Tenant isolation in hosted services
This article describes a new technique for achieving safety when hosting online services, with respect to customer-data security. It also presents an open-source software package, pgtenant, implementing this technique.
When you host an online service for multiple customers, particularly one that stores sensitive information on their behalf, one of the challenges you must solve is to ensure that each customer can only ever access their own data and never someone else’s. This should be true even in case of programming errors.
If the data were stored in plain files, it might suffice to place each customer’s data in its own separate folder or directory. It would be hard for a software bug accidentally to deliver the wrong customer’s data¹ — it would actively have to circumvent the separate-folders safeguard. This makes it a pretty safe approach.
However, very often data in online services is stored partly or entirely in relational databases rather than in plain files. In a relational database, there are a couple of ways to get the same effect as separating files into different folders. One is simple and safe, but costly in terms of computing resources. Another is economical but error-prone: bugs that leak data to the wrong customer can happen passively, by leaving out necessary logic. This is so easy to do accidentally that it’s almost inevitable.
The customers whose data you store are sometimes called “tenants” of your service, and the problem of keeping their data properly separated is called “tenant isolation.” This article describes a new approach to tenant isolation— an approach both economical and safe.
Background
In a relational database, information is collected into tables. Each table defines some number of columns. Data is then added to the table in rows, each row consisting of values for each of the table’s defined columns. This may sound complicated but is in fact dead simple. Here is the time-honored example of an “employee” table in a hypothetical HR database:
Here, the columns are Name, SSN, Hire date, and Salary. There are two rows, one for each of two employees.
To extract information from a relational database, one writes queries in SQL, the Structured Query Language. With SQL it is possible to say “show me the names of employees hired before 2005” or “show me the SSN of Alice Smith.” Those queries look something like this:
SELECT Name FROM Employees WHERE HireDate < '1/1/2005';
SELECT SSN FROM Employees WHERE Name = 'Alice Smith';
Now imagine that this database is not internal to a single company’s HR department, but is instead part of a cloud-hosted HR service. Suppose that many companies use this service to store and manage information about their employees. Then each such company is a tenant of the database, and it’s up to the HR-service provider to isolate the tenants from each other.
One approach is to create separate instances of the database, one per tenant. Software querying data from one instance would have a very hard time mistakenly getting data from a different instance. But multiple instances are slow to set up and expensive to run. This is the “simple, safe, but costly” option mentioned above.
Another approach, and by far the more common, is to combine all tenants’ data into a single database, and to distinguish one tenant’s data from another’s by adding a “tenant ID” column to each row of each table.
Imagine someone from ThingTech wants a list of employees earning more than 35,000. They ask the HR service for this information, and behind the scenes the HR service turns this into a SQL query like this:
SELECT Name FROM Employees WHERE Salary > 35000;
The list comes back: Alice Smith, Carla Doe, and Don Baker. “Who are Carla Doe and Don Baker?” wonders the ThingTech employee. The software has leaked StuffCo’s data in response to ThingTech’s query.
To prevent this, the software is required to narrow each search to a specific tenant ID, like this:
SELECT Name FROM Employees WHERE Salary > 35000
AND TenantID = 'ThingTech';
The problem is that a typical service will have many different tables and many different queries against them. The developers of the service must remember to include a tenant ID column in every table and a “TenantID =” clause in every query. This is bound to get overlooked from time to time. This is the “economical but error-prone” approach.
Automation to the rescue
A data-leaking bug is likely because the same thing needs to be done to very many SQL queries. (Namely, the inclusion of a tenant-ID clause.) When something is mindless and repetitive it’s easy occasionally to skip that step by mistake.
However, when something is mindless and repetitive it is the perfect candidate for automation by software. We’d have greater faith in the tenant-ID approach if, before executing any database query, we could somehow ensure that it included a proper “TenantID =” clause.
One way to achieve this would be to design a SQL query builder. Instead of putting query strings like this directly into our code:
SELECT Name FROM Employees WHERE Salary > 35000;
we would instead ask the builder to create those query strings for us, by telling the builder what kind of query we want. The builder would unfailingly include a “TenantID =” clause in its result.
For simple queries, this isn’t too bad:
builder := NewSQLBuilder(“ThingTech”)
builder.Select(“Name”)
builder.Table(“Employees”)
builder.Where(“Salary > 35000”)
sql := builder.Query()
but this interface is inadequate for expressing more complex queries. We might try to elaborate the builder API to make it more expressive, but the more different kinds of queries one wants to describe to the builder, the closer one comes to redesigning all of SQL, only badly.
Also, to someone reading our code, a series of calls to a SQL builder is never as clear as the simple SQL string they replace. Readability is a very high priority in software engineering. Sacrificing it makes it likely that in the zeal to avoid one kind of bug we just introduce other new ones.
Happily, there is another way, in which we continue to write SQL strings in our code, but those strings are prevented from getting to the database before they can be made tenant-safe. In other words, we continue writing:
SELECT Name FROM Employees WHERE Salary > 35000;
and rely on a transformation layer converting this automatically to
SELECT Name FROM Employees WHERE Salary > 35000
AND TenantID = 'ThingTech';
on its way to being executed by the database.
This involves much more than just pasting “AND TenantID =” onto the end of every query. The transformer needs to parse each query and understand what is in it. If it involves multiple tables, it needs to add a tenant-ID clause for each table. If it doesn’t have a WHERE clause, it has to add one. If it has a WHERE clause with a top-level “OR,” it has to parenthesize it before adding an “AND.” And of course if any other clauses follow the WHERE (such as “GROUP BY” or “LIMIT”), it has to make sure the tenant-ID checks get inserted in the right spot.
This is the approach taken in pgtenant, the Go package implementing these ideas. It creates a tenant-specific wrapper for a database-connection object. The wrapper intercepts each attempt to execute a query. It transforms that query using a caller-supplied column name (such as TenantID
) and value for the tenant-ID before issuing it to the database.
With this approach we no longer have to remember to include “AND TenantID =” in the SQL queries we write. We do have to remember to include a tenant-ID column in every table we create, but if we ever forget, we’ll quickly discover and fix the mistake, because the first query that tries to use that table will fail with a “no such column” error, after being transformed to expect a TenantID column.
The SQL transformer in pgtenant understands many types of SQL query but not every possible one. There are several features of SQL that pgtenant has never yet been needed for, and so to save time and effort those features have been omitted from pgtenant. We are content for the transformer to produce an error if it ever encounters a query it isn’t programmed to handle. Encountering such an error is our cue to update the transformer so that it can understand that kind of query. In this way the capabilities of the transformer expand incrementally.
Because of this, if you use pgtenant in your software, there is the possibility of introducing new, unhandled queries, and not discovering the mistake until after a release to production. This would result in users sometimes encountering no-such-column errors: not as bad as a data leak, but still undesirable.
Good testing practices should help to prevent this from happening. Every query in your codebase can be added to a test suite. The pgtenant transformer must demonstrate its ability to transform every query correctly before the tests can pass, and there should be no software release without passing tests.
But even great testing practices are never perfect. As with the original problem of having to remember a “TenantID =” clause in every query, now we have to remember to add every query, anywhere in the codebase, to a test suite, and sometimes this can be overlooked.
To help you remember to test every query, pgtenant requires you to specify a whitelist of all permitted queries. The database wrapper refuses to issue any query to the database that doesn’t appear in this list. It’s easy to use the same list of queries for both the whitelist and a test suite. You might overlook adding a query to the test suite; you’re less likely to overlook adding it to both the test suite and the whitelist.²
Database queries are often static: they are known ahead of time, they’re hard-coded into the software, and they can be whitelisted in this way.³ Some queries are dynamic: they are assembled only when needed, from query fragments based on user input. There are too many possibilities to enumerate in advance and therefore they can’t be included in the whitelist. For cases like these pgtenant includes an “escape hatch” that allows unwhitelisted queries to pass through to the database — after transformation, of course.
The implementation
Pgtenant is written in Go and can be found at github.com/bobg/pgtenant. Access to the database happens through Go’s built-in database/sql package. A third-party package, github.com/lfittl/pg_query_go, parses SQL queries.
SQL, the language, is (somewhat) standardized and is (somewhat) the same across many types of database — Postgresql, MySQL, and Oracle, among others. But the details of connecting to a database server, sending SQL queries to it, and collecting results are completely different. These differences are encapsulated by Go into the database/sql/driver package. Each database type implements its own version of the driver.Driver
interface, which is responsible for opening database connections. Those connections must satisfy the driver.Conn
interface, which is where the logic for issuing queries and collecting responses lives.
Pgtenant thus has its own implementation of driver.Conn
— and therefore also a driver.Driver
whose Open
method can produce one — as a wrapper around a Postgresql driver.Conn
object (from the github.com/lib/pq package). Our driver.Conn
transforms its input into tenant-safe form before passing it along to the Postgresql one.
Our Driver type appears here. Its Open
and OpenConnector
methods use a pq.Connector
that can open a nested Postgresql connection. That nested connector is created here and its Connect
method called here.
We ensure that the nested connection we receive satisfies our ctxConn
interface, which is just a collection of database/sql/driver
-defined interfaces that our connection type needs. We then put the new nested connection into our own Conn
object, defined here.
Our Conn
object defines some driver.Conn
methods, like Close
and Begin
, very simply: by delegating to the nested connection it contains. QueryContext
and ExecContext
also delegate to the nested connection, but they do query transformation first, via the method Conn.transform
.
Transforming a query string begins by “normalizing” it to remove variations in whitespace that would destroy its ability to be used as a lookup key. Once that’s done, it’s used to index the whitelist (defined here). This is the aforementioned list of permitted queries and their transformations. If it’s found there, the result is simply returned as-is.
If the query is not in the whitelist, then it has to have been explicitly permitted with the aforementioned “escape hatch”: the pgtenant.WithQuery
function, which attaches the query to the Context
object for the current request. (Here is a simple example of WithQuery
being used to permit a query that does not appear in the whitelist.) If it hasn’t been, an error is returned (here).
If that test succeeds, then as an optimization we check a cache to see if we’ve recently transformed the same query, and if we have, we return the cached output.
Finally, if it’s not a precomputed or a recent transformation, we parse the query into a syntax tree, run doTransform
on it, then cache the result and return it.
The doTransform
function first walks the syntax tree looking for the highest-numbered positional parameter referred to in the query. These are the $1, $2, … placeholders that get filled in at runtime with the value arguments passed to QueryContext
and ExecContext
. Our tenant ID will be appended to those values (see below), so it gets a positional parameter with a number that’s one higher than the highest one already in the query. After choosing a number for the tenant ID positional parameter, doTransform
instantiates a transformer object and calls its transformTree
method on the syntax tree.
The transformTree
method is a thin wrapper around transformStmt
, which starts another walk of the syntax tree, passing to its subroutines two new objects: a bytes.Buffer
for collecting the transformed query, and an environ
, which records information about table names encountered during the tree walk.
Four top-level statement types are handled: inserts, updates, deletes, and selects. Let’s take a closer look at transformSelect
, which illustrates the techniques common to all four statement types.
A Postgresql SELECT
statement handled by the transformer has this overall syntax:
WITH ctes
SELECT targets
FROM sources
WHERE conditions
GROUP BY items
HAVING items
ORDER BY items
LIMIT expression
Every part is optional, except for SELECT targets
.
The WITH
clause defines some ctes (“common table expressions”), each of which gets a name and behaves like a table for purposes of selecting from or joining with. This is handled here. (And here is one example of a WITH … SELECT
query.)
Now we add the SELECT
verb to the output (accumulating in the bytes.Buffer
from transformStmt
, referred to here as w
, an io.Writer
). At this point, some SELECT
statements filter their output with the DISTINCT
keyword. This is an example of a SQL feature that pgtenant does not yet support, so if DISTINCT
is seen the transformer produces an error.
In this section, we list the targets (column names, usually) being selected. We handle SELECT *
specially, and produce an error if SELECT *
appears as part of an INSERT
(another not-yet-implemented feature). The targets are a comma-separated list, and each one is individually transformed with the transformSelectCol
method in a call to commaSeparated
. Finally we add the tenant ID column as one of the ones being selected.
Next we handle the FROM
clause. Its items are also produced by the commaSeparated
function, with each item individually transformed, this time by transformNode
. Then we handle the WHERE
, GROUP BY
, HAVING
, ORDER BY
, and LIMIT
clauses, and then we’re done.
As you can see, most of this logic is dedicated simply to walking the syntax tree in order to re-render the SQL string that produced it in the first place. In fact this is true for most of the rest of transform.go
, too. The changes that are needed to the original SQL are few:
- When we encounter
INSERT INTO table (column, …)
we add the tenant ID column name to the list of columns. - If it’s an
INSERT … VALUES
we add the tenant ID value to the list of values (to match the column name we added to the column list). - If it’s an
INSERT … SELECT
instead, we rely on a nested call totransformSelect
to add the tenant ID value. - If the
INSERT
has anON CONFLICT
clause, we add the tenant ID column name to the list of columns. - When we encounter a
WHERE
, we add anAND tenantIDColumn = tenantIDValue
for each table involved. (And when we don’t encounter aWHERE
, we sometimes add one.) Expressions are parenthesized as needed to prevent ambiguity, so for exampleWHERE foo = $1 OR bar = $2
becomesWHERE (foo = $1 OR bar = $2) AND tenant_id = $3
.
Back to the QueryContext
or ExecContext
call that initiated query transformation. Once the transformed query is ready, with its added positional parameter for the tenant ID column, a value for that parameter must be appended to the list of query arguments supplied by the caller. This happens here and here.
The value for the tenant ID comes from the Context
object, using the ID
function (here and here). Naturally the tenant ID must have been attached to the Context
object at some earlier point, using WithTenantID
. An example is here.
This technique has proven itself through years of trouble-free deployment in hosted services with dozens of tenants. It achieves high performance by caching the transformation of most queries, avoiding on-the-fly transformation. It achieves correctness by using those transformations as test vectors, and safety by funneling all database calls through the transformation layer.
Although this implementation is specific to Postgresql, the general approach is suited for all SQL databases.
Endnotes
- Especially if the software takes certain precautions to make wrong-customer data difficult or impossible to access, such as using
chdir
orchroot
before performing data-related operations. - Since the test vector includes input strings together with their expected post-transformation output, the test vector, which is also a whitelist, serves a third role as a precomputed cache of transformed queries, improving performance by relieving us of having to invoke the transformer at runtime.
- With a query like
SELECT Name FROM Employees WHERE Salary > 35000
, the 35000 is not actually part of the query. It’s supplied separately, as an argument, so the same query can be used for any salary threshold without having to list every possible variant. The query string itself isSELECT Name FROM Employees WHERE Salary > $1
, and $1 is replaced with argument number 1 at query time.