6 min read

In this article by Hans JurgenSchonig, the author of the book Mastering PostgreSQL 9.6, we will learn about advanced SQL.

Introducing grouping sets

Every advanced user of SQL should be familiar with GROUP BY and HAVING clauses. But are you also aware of CUBE, ROLLUP, and GROUPING SETS? If not this articlemight be worth reading for you.

Loading some sample data

To make this article a pleasant experience for you, I have compiled some sample data, which has been taken from the BP energy report at http://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy.html.

Here is the data structure, which will be used:

test=# CREATE TABLE t_oil (
  region text,
  country text,
  year int,
  production int,
  consumption int
);
CREATE TABLE

The test data can be downloaded from our website using curl directly:

test=# COPY t_oil FROM PROGRAM
'curl www.cybertec.at/secret/oil_ext.txt';
COPY 644

On some operating systems curl is not there by default or has not been installed so downloading the file before might be the easier option for many people.

All together there is data for 14 nations between 1965 and 2010, which are in two regions of the world:

test=# SELECT region, avg(production)
FROM t_oil
GROUP BY region;
    region     |          avg
---------------+---------------------
Middle East   | 1992.6036866359447005
North America | 4541.3623188405797101
(2 rows)

Applying grouping sets

The GROUP BY clause will turn many rows into one row per group. However, if you do reporting in real life, you might also be interested in the overall average. One additional line might be needed.

Here is how this can be achieved:

test=# SELECT region, avg(production)
  FROM t_oil
GROUP BY ROLLUP (region);
    region     |          avg
---------------+-----------------------
Middle East   | 1992.6036866359447005
North America | 4541.3623188405797101
               | 2607.5139860139860140
(3 rows)

The ROLLUP clause will inject an additional line, which will contain the overall average. If you do reporting it is highly likely that a summary line will be needed. Instead of running two queries, PostgreSQL can provide the data running just a single query.

Of course this kind of operation can also be used if you are grouping by more than just one column:

test=# SELECT   region, country, avg(production)                                                                                                 FROM    t_oil                                                                                                                                    WHERE   country IN ('USA', 'Canada', 'Iran', 'Oman')                                                                                             GROUP BY ROLLUP (region, country);
    region     | country |          avg
---------------+---------+-----------------------
 Middle East   | Iran    | 3631.6956521739130435
 Middle East   | Oman    |  586.4545454545454545
 Middle East   |         | 2142.9111111111111111
 North America | Canada  | 2123.2173913043478261
 North America | USA     | 9141.3478260869565217
 North America |         | 5632.2826086956521739
               |         | 3906.7692307692307692
(7 rows)

In this example, PostgreSQL will inject three lines into the result set. One line will be injected for Middle East, one for North America. On top of that we will get a line for the overall averages. If you are building a web application the current result is ideal because you can easily build a GUI to drill into the result set by filtering out the NULL values.

The ROLLUPclause is nice in case you instantly want to display a result. I always used it to display final results to end users. However, if you are doing reporting, you might want to pre-calculate more data to ensure more flexibility. The CUBEkeyword is what you might have been looking for:

test=# SELECT   region, country, avg(production)                                                                                                 FROM    t_oil                                                                                                                                    WHERE   country IN ('USA', 'Canada', 'Iran', 'Oman')                                                                                             GROUP BY CUBE (region, country);
    region     | country |          avg
---------------+---------+-----------------------
 Middle East   | Iran    | 3631.6956521739130435
 Middle East   | Oman    |  586.4545454545454545
 Middle East   |         | 2142.9111111111111111
 North America | Canada  | 2123.2173913043478261
 North America | USA     | 9141.3478260869565217
 North America |         | 5632.2826086956521739
               |         | 3906.7692307692307692
               | Canada  | 2123.2173913043478261
               | Iran    | 3631.6956521739130435
               | Oman    |  586.4545454545454545
               | USA     | 9141.3478260869565217
(11 rows)

Note that even more rows have been added to the result. The CUBEwill create the same data as: GROUP BY region, country + GROUP BY region + GROUP BY country + the overall average. So the whole idea is to extract many results and various levels of aggregation at once. The resulting cube contains all possible combinations of groups.

The ROLLUP and CUBE are really just convenience features on top of GROUP SETS. With the GROUPING SETS clause you can explicitly list the aggregates you want:

test=# SELECT   region, country, avg(production)                                                                                                 FROM    t_oil                                                                                                                                    WHERE   country IN ('USA', 'Canada', 'Iran', 'Oman')                                                                                             GROUP BY GROUPING SETS ( (), region, country);
    region     | country |          avg
---------------+---------+-----------------------
 Middle East   |         | 2142.9111111111111111
 North America |         | 5632.2826086956521739
               |         | 3906.7692307692307692
               | Canada  | 2123.2173913043478261
               | Iran    | 3631.6956521739130435
               | Oman    |  586.4545454545454545
               | USA     | 9141.3478260869565217
(7 rows)

In this I went for three grouping sets: The overall average, GROUP BY region and GROUP BY country. In case you want region and country combined, use (region, country).

Investigating performance

Grouping sets are a powerful feature, which help to reduce the number of expensive queries. Internally,PostgreSQL will basically turn to traditional GroupAggregates to make things work. A GroupAggregate node requires sorted data so be prepared that PostgreSQL might do a lot of temporary sorting:

test=# explain SELECT   region, country, avg(production)                                                                                         FROM    t_oil                                                                                                                                    WHERE   country IN ('USA', 'Canada', 'Iran', 'Oman')                                                                                             GROUP BY GROUPING SETS ( (), region, country);
                            QUERY PLAN
---------------------------------------------------------------
GroupAggregate  (cost=22.58..32.69 rows=34 width=52)
   Group Key: region
   Group Key: ()
   Sort Key: country
     Group Key: country
   ->  Sort  (cost=22.58..23.04 rows=184 width=24)
         Sort Key: region
         ->Seq Scan on t_oil
    (cost=0.00..15.66 rows=184 width=24)
         Filter: (country = ANY
('{USA,Canada,Iran,Oman}'::text[]))
(9 rows)

Hash aggregates are only supported for normal GROUP BY clauses involving no grouping sets. According to the developer of grouping sets (AtriShama), adding support for hashes is not worth the effort so it seems PostgreSQL already has an efficient implementation even if the optimizer has fewer choices than it has with normal GROUP BY statements.

Combining grouping sets with the FILTER clause

In real world applications grouping sets can often be combined with so called FILTER clauses. The idea behind FILTER is to be able to run partial aggregates.

Here is an example:

test=# SELECT   region,                                                                                                                                  avg(production) AS all,                                                                                                                          avg(production) FILTER (WHERE year < 1990) AS old,                                                                                               avg(production) FILTER (WHERE year >= 1990) AS new                                                                                       FROM    t_oil                                                                                                                                    GROUP BY ROLLUP (region);
    region     |       all      |      old       |      new                           
---------------+----------------+----------------+----------------
 Middle East   | 1992.603686635 | 1747.325892857 | 2254.233333333
 North America | 4541.362318840 | 4471.653333333 | 4624.349206349
               | 2607.513986013 | 2430.685618729 | 2801.183150183
(3 rows)

The idea here is that not all columns will use the same data for aggregation. The FILTER clauses allow you to selectively pass data to those aggregates. In my example, the second aggregate will only consider data before 1990 while the second aggregate will take care of more recent data.

If it is possible to move conditions to a WHERE clause it is always more desirable as less data has to be fetched from the table. The FILTERis only useful if the data left by the WHERE clause is not needed by each aggregate.

The FILTER works for all kinds of aggregates and offers a simple way to pivot your data.

Summary

We have learned about advanced feature provided by SQL. On top of the simple aggregates,PostgreSQL provides, grouping sets to create custom aggregates. 

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here