5 min read

(For more resources related to this topic, see here.)

XML is one solution to dealing with hierarchical data, but it isn’t the most natural for the relational database. Instead, you often wind up with nested categories, or filesystem-like folder hierarchies, or links back to older records. A popular way to structure a relational database for data with this shape is using a self reference, a link back to a parent object in the same table.For instance, you might want to model categories that can have subcategories with arbitrary nesting. A simple table might look like this:

CREATE TABLE categories (
id SERIAL PRIMARY KEY,
name VARCHAR,
parent_id INTEGER REFERENCES categories
);

What makes this structure recursive is that self-referencing parent_id column, which refers to the table we’re defining from within its own definition. We would treat categories with a NULL value for the parent_id column as top-level categories, categories that do not belong within any other category.

To get a feel for the kind of data, let’s put a few categories in there for an online retailer. Say we sell shirts and books, and books we further divide into fiction and non-fiction, and then we’ll put programming books inside non-fiction. It might look like the following:

INSERT INTO categories (id, name, parent_id) VALUES
(1, 'Shirts', NULL),
(2, 'Books', NULL),
(3, 'Fiction', 2),
(4, 'Non-fiction', 2),
(5, 'Programming', 4);

Usually you won’t put these in manually like this, but you can see that Fiction and Non-fiction are children of the Books category because their parent_id values are 2, the value for Books is NULL and Programming has the parent_id value as 4, which is the id value for Non-fiction.

Suppose you want to make navigation breadcrumbs given a category. You need to look up that category and then you need to look up the parent category. You need to do this until the parent category is NULL. In a procedural pseudocode, the process might look like the following:

def write_breadcrumbs(category):
row = getOne("SELECT * FROM categories
WHERE id = ?", category)
while row['parent_id'] != NULL:
write(row['name'])
write(' > ')
row = getOne("SELECT * FROM categories
WHERE id = ?", row['parent_id'])

This kind of solution leads to the N+1 query problem. There’s an action you want to take for a particular value, but to take that action you have to run an arbitrary number of separate queries. Recursive queries in PostgreSQL provide you with a way to have the database do the heavy lifting instead. Because we’ll be using recursion in the SQL, let’s first see what a recursion formulation would look like in our pseudocode:

def write_breadcrumbs(category):
row = getOne("SELECT * FROM categories
WHERE id = ?", category)
write(row['name'])
if row['parent_id'] != NULL:
write(' > ')
write_breadcrumbs(row['parent_id'])

It’s debatable whether this is better code; it’s shorter, and it has fewer bugs, but it also might expose the developer to the possibility of stack overflows. Recursive functions always have some similar structure though, some number of base cases that do not call the function recursively and some number of inductive cases that work by calling the same function again on slightly different data. The final destination will be one of the base cases. The inductive cases will peel off a small piece of the problem and solve it, then delegate the rest of the work to an invocation of the same function.

PostgreSQL’s recursive query support works with something called the common table expressions (CTEs). The idea is to make a named alias for a query. We won’t delve into the details too much here, but all recursive queries will have the same basic structure:

WITH RECURSIVE recursive-query-name AS (
SELECT <base-case> FROM table
UNION ALL
SELECT <inductive-case> FROM table
JOIN <recursive-query-name> ON )
SELECT * FROM <recursive-query-name>;

For an example, let’s get all the categories above Programming. The base case will be the Programming category itself. The inductive case will be to find the parent of a category we’ve already seen:

WITH RECURSIVE programming_parents AS (
SELECT * FROM categories WHERE id = 5
UNION ALL
SELECT categories.* FROM categories
JOIN programming_parents ON
programming_parents.parent_id = categories.id)
SELECT * FROM programming_parents;

This works as we’d hope.

Output of a simple recursive query

Without using this trick we’d have to do three separate queries to get this information, but with the trick it will always take one query no matter how deeply nested the categories are.

We can also go in the other direction and build up something like a tree underneath a category by searching for categories that have a category we’ve already seen as a parent category. We can make the hierarchy more explicit by building up a path as we go:

WITH RECURSIVE all_categories AS (
SELECT *, name as path FROM categories WHERE parent_id IS NULL
UNION ALL
SELECT c.*, p.path || '/' || c.name
FROM categories AS c
JOIN all_categories p ON p.id = c.parent_id)
SELECT * FROM all_categories;

Finding all the categories with their path

We can be more discriminating with the search by looking for a particular category as the starting value in the base case:

WITH RECURSIVE all_categories AS (
SELECT *, name as path FROM categories WHERE id = 2
UNION ALL
SELECT c.*, p.path || '/' || c.name
FROM categories AS c
JOIN all_categories p ON p.id = c.parent_id)
SELECT * FROM all_categories;

The category tree rooted at Books

This is hugely useful with hierarchical data.

Summary

In this article we learned about recursive queries as one of the most important features of PostgreSQL.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here