I am a big proponent of using the right tool for the job. And as far as databases are concerned, I believe that in most cases a relational database is still the correct tool. However, one historical drawback of relational databases has been a difficulty in retrieving hierarchical data in a way that is easily compatible with most programming languages. Fortunately, support for modern data types such as JSON and advanced features including recursive common table expressions (CTEs) provide some excellent ways to tackle this problem.
My relational database of choice is PostgreSQL, which is open source and feature rich. It is fantastic, and you should definitely be using it. Postgres version 9.3 was recently released with even better support for the JSON data type among its new features. Here are some methods I routinely use to retrieve structured, hierarchical data using raw SQL in a single query.
Nested Objects
When working with data in your backend code, you often need quick access to the relationships within your data model. Including nested JSON objects in your result set is a great way to accomplish this naturally. This approach works particularly well if you’re interfacing with the database using node.js or another platform with fast JSON decoding.
Imagine the following hypothetical database of projects managed by employees:
With some data:
Suppose we wanted to return a list of projects and the employees responsible for them as an array of nested objects. PostgreSQL provides some useful JSON functions for manipulating data, and we will use the row_to_json
function to nest objects directly in the query results. row_to_json
provides the ability to turn a database row into a json object, which is the key. Consider the following query:
If we’re using the pg node.js library to interface with Postgres, this query will return the following directly to our Javascript:
Exactly what we’re looking for. It is very natural to work with this kind of data structure in Javascript or any other imperative language.
Augmenting Nested Objects
Sometimes it is necessary to return additional fields along with a given object that may not be directly included in the database table (due to normalization or other factors). For example, perhaps we are storing the date a project was assigned but would also like to return the age of that project in days. Let’s add to our previous data model:
This time we are going to reverse the desired results and query for a list of employees and their respective projects. But let’s add the age of each project in addition to the date assigned. How can we accomplish this?
One approach would be to use a subquery to create a new virtual table, and then use row_to_json
to transform that virtual table row into JSON. We will do essentially the same thing, but use a common table expression to redefine the project table with an additional “age” column. Also note that since employee->project is a one-to-many relationship, we are going to use the json_agg
aggregate function to return a JSON array of objects instead of a single object as we did previously with row_to_json
.
This query returns the following JSON to our Javascript backend. Note it is filtered to a single employee to reduce the size of the result set:
Recursive Common Table Expressions
Your dataset may include more complicated relationships in the form of tree-structured, hierarchical data as well. For example, tracking the management structure of our sample organization, where each “employee” reports to an immediate superior. This forms a tree of employees with the boss as the root node followed by subordinates. Probably the most common way to model such a tree is using the Adjacency List Model. In this representation, each employee record contains a reference to their immediate superior. If they are the head of the organization, this reference is null.
Let’s modify our existing schema once more to reflect this:
We can now use a recursive CTE (common table expression) to return this tree of data in a single query along with the depth of each node. Recursive CTEs allow you to reference the virtual table within its own definition. They take the form of two queries joined by a union, where one query acts as the terminating condition of the recursion and the other joins to it. Technically they are implemented iteratively in the underlying engine, but it can be useful to think recursively when composing the queries.
The above query will return Jon Snow, employee_id = 1
, and all employees subordinate to him either directly or through the tree. In our small dataset, the result looks like this:
This is fairly straightforward in our example, but it can be a powerful tool on larger trees especially when extended and combined with other techniques.
Combining Everything
We can use recursive CTEs in conjunction with the JSON functions to produce some really useful results. The following query will return the record for “Hareth”, employee_id = 7
, and a nested list of his superiors up to and including the root node. Each superior will be a complete employee object including their projects.
This query returns the following nested objects:
This is really just the tip of the iceberg, but as you can see there is considerable power available natively from PostgreSQL. It is great to be able to work with data directly from a database query without having to reformat it, use a mapping layer, or run any additional queries.