Faster, robuster and feature richer: MonetDB in 2020 and beyond

Niels The MonetDB team has completed her major development strands for 2020. The MonetDB Oct2020 feature release includes significant improvements in performance under frequent bulk-update transactions, highly requested features such as point-in-time database backup and more analytics features, and many more extensions for the convenience of application developers.

“In 2020 we have conducted some major renovations of the MonetDB internals to improve speed, functionality and stability to strengthen its role in the enterprise market,” said Niels Nes, CTO and co-founder of MonetDB Solutions. “The Oct2020 release is the last functional release for this year. It lays the basis for some cool items to be expected next year.”

MonetDB, an innovative data management system known for its high-performance and robustness, benefits from over 25 years of open source development from a global developer community and has become the preferred open-source analytical database of organisations of all sizes addressing their big data and data analytics challenges.

Performance Improvements

MonetDB was initially designed to achieve the maximum speed for OLAP queries while assuming updates are rare. This assumption started to change because more and more users who have adopted MonetDB for their analytical workload also perform frequent bulk updates (INS/DEL/UPD) on their databases. Our primary goal is to ensure that analytical queries do not suffer from performance degradations in such use cases.

Linear hash We have extended MonetDB’s static hash index, which was too quickly discarded under updates and recreated by the next OLAP queries, to use a technique based on Linear Hashing. The index can grow gracefully with new data arriving, which reduces the impact of data insertions on OLAP queries.

Candidate lists MonetDB uses candidate lists internally to specify which rows of a column participate in a computation, e.g. a SELECT applies the condition on the items qualified by its input candidate list and returns a possibly shorter candidate list indicating the selected items. We have introduced a new type of candidate list, called negative candidate list, which, as opposed to a positive candidate list, denotes the tuples that should not be considered in the remainder of the query execution.

MonetDB TPC-H SF100 performance

CASE & COALESCE Both candidate lists reduce the impact of DELETEs for OLAP queries. When a small percentage of tuples are deleted, we use a negative candidate list to keep track of them; otherwise, we use a positive candidate list. In the latest release, they are used to speed up CASE and COALESCE, often used by data scientists, by getting rid of conditional executions.

Performance chronicle On our reference benchmark TPC-H SF-100 (i.e. 100GB data), the overall performance of MonetDB has improved with more than 50% since 2017 on the same testing desktop (Intel NIC I7, 4-cores 2.1 GHz, 32GB RAM and 1TB NVMe).

Analytical features

ROLLUP, CUBE, GROUPING SETS Database systems aiming at the data analytics market should be strong in the SQL functionality offered by the system. We have completed our support for the SQL windowing and aggregation functions by implementing COVER_POP, CORR, GROUP_CONCAT, etc., and added support for the SQL:1999 data warehouse functionality ROLLUP, CUBE and GROUPING SETS.

Resource limits MonetDB automatically runs queries with the maximum parallelisation possible. Therefore, an unexpectedly gruelling OLAP query can exhaust all hardware resources. In a multitenant setting, this can render the MonetDB server unresponsive to other users’ queries. To counter these cases, we have added several functions with which a DBA can limit the number of CPU cores and the amount of memory available to individual users.

MonetDB/e NYC taxi benchmark

MonetDB/e One of the most significant renovations is the isolation of the database kernel into a standalone library for embedding in programming languages. It is covered in a separate post.

Improved Robustness

Memory fixes With the help of our commercial enterprise users, we have cornered several memory leaks and glitches that only manifested themselves in long-running production environments under heavy concurrent users and mixed OLAP queries with transactions workload. These cases typically take several days/weeks before they escalate to a system error.

SQLancer With the help of SQLancer, an automatic DBMS testing tool, we have been able to identify >100 potential problems in corner cases of the SQL processor. Those problems were found and resolved before they could reach the MonetDB users.

Hot snapshot A SQL procedure HOT_SNAPSHOT allows a DBA to take a snapshot of a running database server to a compressed tar file. It creates a point-in-time backup.

Developer Conveniences

Scoping SQL is a language intergalactic but never designed as a real programming language. One of the areas this is felt is the use of (global) variables, table and column names, and schema functions. Instead of relying on identifier markers, we have cleaned up the name resolution scheme. Variables with the same name at a query are now resolved under the following precedence rules: (1) tables, views and CTEs at the FROM clause; (2) variable declared in the body of function/procedure, i.e. local variable; (3) function/procedure parameter; and (4) variable from the global scope.

IPv6 MonetDB now supports both IPv4 and IPv6 protocols. The settings for specifying how mserver5 should listen to “the Internet” have been overhauled. If “listenaddr” equals “localhost” or “all”, we listen to both IPv4 and IPv6 (if available), if “127.0.0.1” or “0.0.0.0”, we listen to IPv4 only, if “::1” or “::” we listen to IPv6 only. The first of each pair is the loopback interface only; the second is all interfaces. If “listenaddr” is “none”, then no IP port is opened; instead, you have to use a UNIX domain socket. If “port” is 0, we let the operating system choose a free port.

SQL compliance In the Oct2020 release, we made our implementation of several SQL features compliant to the standard. For instance, the general logarithm function log(x, base) is now log(base, x). The compatibility and implicit casting between interval types and other numeric types are removed; instead, INTERVAL types must be used explicitly.

Tracer A general logging system, called a tracer, has been introduced. It provides users finer-grained control of the levels of information to log, i.e. from CRITICAL, ERROR, WARNING, INFO to DEBUG, and a consistent reporting scheme.

CMake An important change was the methods for source-based compilation. After two decades of being a happy user of the Linux Autotools library, we renovated the building process to a new infrastructure completely based-on CMake. This change secures better portability in the years ahead on the many platforms MonetDB already supports. Besides, this reduces the compilation time to only ~1.5 minutes for a debugging build and ~4.5 minutes for an optimised production build.

Outlook towards 2021

“The MonetDB team is already working hard on some cool stuff to be released in 2021,” says Ying Zhang, COO, product manager and co-founder of MonetDB Solutions. “In close cooperation with our customers, we identify the shortest route towards an even more powerful environment”.

Some topics being brewed in the software kitchen are:

  • Storage layer improvements, less data copying.
  • Resource control with user session profiles.
  • Privacy using obfuscation and masking.
  • MonetDB/e as a lightweight server.
  • A mix of views and base tables.
  • Candidate list storage reduction.
  • Indexes for faster information retrieval and regex.
  • Partial column indices.
  • End-to-end solutions in the Cloud.