Data compression

Data compression mk Tue, 06/21/2011 - 07:36

A distinctive feature of column stores is to apply aggressive data compression. However, compression is often a two-edged sword, where movement of large data files over relative slow networks, disk access or memory interconnects is compensated for by applying CPU cycles. The effectiveness of which strongly depends on the mapping from database schema into data structures,  their high maintenance cost, the relational algorithms,  the system architecture, and the data distribution. The compression ratios cited  depends on the input size, which is commonly assumed to be in CSV format, the data distribution, and the database storage footprint, with or without auxiliary data structures like indices.

MonetDB applies different compression techniques automatically at many levels.

The column store representation is highly optimized, where the basic storage structure is a dense array, i.e. without holes to accommodate future insertions or overhead caused by the data structure itself (e.g. B-trees). This dense representation allows for direct mapping of the database files into memory. The storage width ranges from 1 (byte) to 8 bytes (doubles).  NULL values are part of the domain space, which avoids auxiliary bit masks at the expensive of 'loosing' a single value from the domain.

All strings are stored using dictionary encoding. This significantly reduces their storage space, but with larger dictionaries the maintenance cost may become expensive. Therefore for really large dictionary tables, MonetDB resort to non-compressed string representation. The references into the dictionary table occupy anywhere from 1 to 8 bytes, depending on the number of elements.

During query evaluation, a dense range of results is represented by a column view. This is a small footprint representation of the result set. It avoids both copying the result and storing it in its private column structure.