SQL:1999 grouping combinations now supported in MonetDB

As of MonetDB Jun2020 release, we have added support for ROLLUP, CUBE and GROUPING SETS features defined by the SQL:1999 standard in the GROUP BY clause. These keywords allow one to perform multiple grouping set combinations in a single statement, containing partial and total computed results. At the same time, the GROUPING aggregate, which identifies the grouping columns used in the output, has also been implemented.

The grammar is as follows:

group_by_clause:
  GROUP BY group_by_element [',' group_by_element ... ]

group_by_element:
  identifier
| ROLLUP '(' identifier [',' identifier ... ] ')'
| CUBE '(' identifier [',' identifier ... ] ')'
| GROUPING SETS '(' group_by_element [',' group_by_element ... ] ')'
| '(' ')'

The expression '(' ')' means a grouping without columns, i.e. a global aggregate.

In practice, whenever ROLLUP, CUBE and GROUPING SETS are used inside the GROUP BY clause, the incoming data is split for the generated grouping combinations, then each grouping is computed, then the results are returned. This can be seen as multiple GROUP BY clauses being generated, computed, then merged with UNION statements. Depending on the grouping keyword used, the generated grouping combination is different.

ROLLUP represents a hierarchy of columns. For n grouping columns, ROLLUP generates n+1 groupings. Aan example, ROLLUP (a,b,c) translates into four grouping combinations: (a,b,c), (a,b), (a) and ().
CUBE represents all possible combinations of values in the supplemented columns. For n grouping columns, CE generates 2^n groupings. As an example, CUBE (a,b,c) translates into eight grouping combinations: (b,c), (a,b), (a,c), (b,c), (a), (b), (c) and ().
GROUPING SETS computes the list of grouping combinations supplied as they are with no changes. For n grouping columns, GROUPING SETS generates n groupings. As an example, GROUPING SETS (a,b,c) translates into three grouping combinations: (a), (b) and (c).

Nested and mixed groupings:

GROUPING SETS allows nesting of combinations, however, they are merged altogether. Aan example, GROUPING SETS (a, GROUPING SETS (b), ROLLUP (c, d)), translates into (a), (b), (c, d), (c), ().
The elements of ROLLUP and CUBE may contain lists of columns, in which they are treated as a single unit. Aan example, ROLLUP (a,b,c,d), which translates into (a,b,c,d), (a,b,c), (a,b), (a) and (), is different fm ROLLUP ((a,b),(c,d)), which translates into (a,b,c,d), (a,b) and ().
If multiple grouping elements are given in the GROUP BY clause, the output combinations are the cross product of every generated sublist altogether. As an example, GROUP BY ROLLUP(a,b), c, CUBE(d,e) means ((a,b),(a),()) x (c) x ((d,e),(d),(e),()), which translates into (a,b,c,d,e), (a,b,c,d), (a,b,c,e), (a,b,c), (a,c,d,e), (a,c,d), (a,c,e), (c,d,e), (c,d), (c,e), (c).

The GROUPING aggregate has also been added to identify the grouping columns used for each row computed in the output. This aggregate takes as input identifiers for the group columns used in the GROUP BY clause. The output is bitmask indicating which of the arguments were not used in the output grouping set. The last grouping element corresponds to the least-significant bit.

Demonstration

Below, we list example queries with the added functionalities.

CREATE TABLE categories (id INTEGER, description TEXT);
CREATE TABLE sections (id INTEGER, description TEXT);
CREATE TABLE products (id INTEGER, categoryid int, sectionid INTEGER,
                       description TEXT, price DECIMAL(6,2));
CREATE TABLE sales (productid INTEGER, sale_day DATE, units INTEGER);

INSERT INTO categories VALUES (1, 'fresh food'), (2, 'dry food'), (3, 'drinks');
INSERT INTO sections VALUES (1, 'front'), (2, 'side'), (3, 'back');
INSERT INTO products VALUES
    (1, 1, 1, 'apples', 1.5), (2, 1, 2, 'melons', 4.0),
    (3, 2, 2, 'peanuts', 2.0), (4, 3, 1, 'water', 1.0),
    (5, 3, 3, 'wine', 5.0), (6, 2, 3, 'walnuts', 1.5);
INSERT INTO sales VALUES
    (1, date '2020-03-01', 10), (2, date '2020-03-01', 3),
    (4, date '2020-03-01', 4), (1, date '2020-03-02', 6),
    (4, date '2020-03-02', 5), (5, date '2020-03-02', 2),
    (1, date '2020-03-03', 7), (3, date '2020-03-03', 4),
    (2, date '2020-03-03', 3), (5, date '2020-03-03', 1),
    (6, date '2020-03-03', 1);

-- In the first example we compute the total sales per product per day, then
--   the total sales per product, and finally the overall total using the
--   ROLLUP construct.
-- If a column was not used for the aggregate, it gets a NULL value by default.
--   In our query, we use COALESCE to give them a meaningful name.

SELECT COALESCE(products.description, 'all_products') AS description,
       COALESCE(sale_day, 'all_days') AS sale_day, totals.total
  FROM (
    SELECT productid, sale_day, SUM(units * price) AS total
      FROM products
      LEFT JOIN sales ON sales.productid = products.id
     GROUP BY ROLLUP(productid, sale_day)
) AS totals
 LEFT JOIN products ON products.id = totals.productid
ORDER BY sale_day NULLS LAST, productid NULLS LAST;

description	sale_day	total
apples	2020-03-01	15.0000
melons	2020-03-01	12.0000
water	2020-03-01	4.0000
apples	2020-03-02	9.0000
water	2020-03-02	5.0000
wine	2020-03-02	10.0000
apples	2020-03-03	10.5000
melons	2020-03-03	12.0000
peanuts	2020-03-03	8.0000
wine	2020-03-03	5.0000
walnuts	2020-03-03	1.0000
apples	all_days	34.5000
melons	all_days	24.0000
peanuts	all_days	8.0000
water	all_days	9.0000
wine	all_days	15.0000
walnuts	all_days	1.5000
all_products	all_days	92.0000

-- In the following query, we compute the total sales by category, section and
--   global. The left joins in the outer query are used to get the description.
SELECT COALESCE(sections.description, 'all_sections') AS section,
       COALESCE(categories.description, 'all_categories') AS category,
       totals.total
  FROM (
    SELECT categoryid, sectionid, SUM(units * price) AS total
      FROM products
      LEFT JOIN sales ON sales.productid = products.id
     GROUP BY GROUPING SETS ((categoryid), (sectionid), ())
) AS totals
LEFT JOIN categories ON totals.categoryid = categories.id
LEFT JOIN sections ON totals.sectionid = sections.id;

description	sale_day	total
front	all_categories	43.5000
side	all_categories	32.0000
back	all_categories	16.5000
all_sections	fresh food	58.5000
all_sections	drinks	24.0000
all_sections	dry food	9.5000
all_sections	all_categories	92.0000

-- We can use the GROUPING aggregate to identify which of the columns were used
--   in the grouping, then filter the result based on this information. In the
--   next example we create a view with the computed results of the previous
--   example and then we filter the category exclusive aggregates.

CREATE VIEW sales_totals AS
  SELECT GROUPING(categoryid) AS category_aggregates,
         GROUPING(sectionid) AS section_aggregates,
         categoryid, sectionid, SUM(units * price) AS total
    FROM products
    LEFT JOIN sales ON sales.productid = products.id
   GROUP BY GROUPING SETS ((categoryid), (sectionid), ());

SELECT description, total
  FROM sales_totals
  LEFT JOIN categories ON sales_totals.categoryid = categories.id
 WHERE category_aggregates = 0;

description	total
fresh food	58.5000
drinks	24.0000
dry food	9.5000

Implementation notes

Some implementation notes are left here:

Only identifiers are allowed inside ROLLUP, CUBE and GROUPING SETS clauses. Some other databases allow expressions as a non SQL standard extension.
The SQL type of the result of the GROUPING aggregate varies with the number of grouping columns. Up to 7 columns, it is a TINYINT; between 8 and 15 columns a SMALLINT; between 16 and 31 columns an INTEGER; between 32 and 63 columns a BIGINT; and finally a HUGEINT for 64 to 127 columns if available.

Developer and support

The grouping combinations extension was developed by Pedro Ferreira, a software developer at MonetDB Solutions.