The Feb2013 release saw a number of significant changes to the internals of MonetDB. In this piece I'll write about some of that.
Internally MonetDB uses data structures called Binary Association Tables (BAT for short). These structures contain two columns of arbitrary types of the same length, one called the head column, the other the tail column. Most internal operators can deal with BATs with any types, where it makes sense. There is one special type, called OID (Object IDentifier) which is a number with a special significance. If multiple BATs are associated with each other (e.g., together they form a multi-column table), the OID values indicate which rows in the various BATs belong together.
OIDs come in two types. One is "materialized", i.e. taking up real memory. The other is "virtual". The latter can be used when a column consists of OIDs where each entry is exactly one larger than the previous. In this case, we don't need to store the actual values but we can just store the starting value. The column itself then takes up no space at all. An OID column where each value is one larger than the previous (including virtual OID columns) is called "dense".
In the implementation of SQL we found that in almost all cases, BATs used the virtual OIDs in the head column. In combination with the generic implementation of internal operators, this turned out to cause avoidable overhead.
In SQL, when you do a selection over multiple columns (e.g. SELECT * FROM t WHERE t.a = X AND t.b = Y AND t.c = Z), what we needed to do was the following.
First a select was done over the first column. The internal select operator is generic, so it can deal with arbitrary column types and it produces a subset of the input column of all rows that match the value. The actual tail value is then not needed anymore (so we already had a version that only copied the relevant rows from the head column) and the head values are then used to create a subset of the second BAT. Then we could do the second select over the subset of the second BAT, resulting in a similar partial copy of the head column.
We saw this and finally came up with a solution: subselect. Subselect in its simplest form is similar to the old select that discards the tail values. There are, however, a number of significant differences. Firstly, subselect will only accept input BATs with a dense OID head column and arbitrary tail column. Secondly, subselect produces a dense-headed BAT with the OID values of matching rows in the tail. Those OID values came from the head column of the input BAT.
A slightly more complex form of subselect takes the same input BAT and selection parameters as the simplest form, but in addition it takes the result of a previous call to subselect as input. This extra BAT is called the candidate list. This version of subselect will do the same selection as the simpler version, except it will only consider values that are indicated by the candidates list, returning a subset of the candidates list.
Using this new implementation, the multi-column select now becomes a bit simpler. First a subselect is done over the first column. This results in a list of OIDs. Then a subselect is done over the second column with the result of the previous subselect as candidates list. Etc. There is no need for extra intermediate copies of the input BATs.
Another idiom that happens a lot in SQL is grouped aggregation (e.g. SELECT avg(t.a) FROM t where t.b = X GROUP BY t.c). In the old situation, we needed to make several extra copies of the input columns in order to create a BAT that could be passed to the grouped aggregation function.
In comes the sub-aggregation family of interfaces. These interface take a dense-headed input BAT together with a BAT that indicates the grouping of the values. The grouping BAT is also dense-headed, and has an OID tail. Equal values in the tail indicate that the corresponding values in the input BAT are in the same group. To make things easier, there is a third input dense-headed input BAT that indicates which (and how many) groups are in the grouping BAT. All tail values in the grouping BAT must occur in the head of this extra BAT. In addition to these inputs, the functions also take a candidates list which is identical in format to the one subselect uses and produces. The result of the aggregation is a dense-headed BAT that contains the aggregated value for each of the groups.
Sorting was another cause of multiple copies of intermediates. The old interface just reordered the input BAT so that the head column was sorted. Note however that in the SQL implementation, all significant values are in the tail, so that means that BATs have to flipped (head becomes tail and vice versa--luckily a very cheap operation).
The new interface, subsort, takes a dense-headed BAT and produces two outputs. One is a dense-headed BAT that gives the ordered result, and one is a dense-headed BAT that indicates in the tail the way the input had to be reordered. In addition, subsort produces grouping information in the exact form that the grouped aggregations need. All outputs are in fact optional. If they are not needed, they don't have to be produced.
Subsort optionally takes grouping information as input. However, unlike aggregates, members of a group need to be consecutive. Instead of sorting the whole column when grouping information is provided, subsort will sort each group individually. It again optionally produces the new ordering and grouping information where the new groups are subsets of the input groups (i.e. equal values in different input groups are in different output groups).
Subsort also optionally takes ordering information as produced by a previous run as input. When combined with grouping information, the input BAT needs to be reordered according to the ordering information after which the grouping information indicates to which groups values belong.
Using these outputs it is easy to sort over multiple columns. First sort the most significant column, producing ordering and grouping information. Then provide these outputs as inputs when sorting the next column. Etc.
We added a number of new internal interfaces and either reimplemented the old interfaces using the new, or just removed the old interfaces. The SQL code generator now uses the new interfaces instead of the old. In most cases that we have seen this has resulted in often very significant speedups.
I didn't talk about the new grouping code that can also easily be cascaded. It produces outputs compatible with the output of subsort and compatible with the input to the grouped aggregate functions.
We are continuing with this work. For instance, join could also use candidate lists on its inputs.