2024 continues to be a year of SQL features due to the recently published SQL:2023 standard.
Further, we have planned some changes deep in the MonetDB kernel so that MonetDB will be able to better support new applications, such as vector and graph data storage and processing.
Finally, the new security related features should be ready for release and the new testing tools for MonetDB should be fully operational this year.
After the most recent feature release of Dec2023, we plan to produce only one major release this year.
The release will probably be done in the second half of the year and contain the following major features:
MonetDB users can expect many new SQL language features and improvements of existing features. For instance, how UNIQUE and NULL values are treated, support for CHECK CONSTRAINT in CREATE/ALTER TABLE, and additional syntax for GROUP BY and ORDER BY clauses.
The coming major release will have support for encryption using TLS/SSL for the
connections between MonetDB client and server.
In addition, since DBAs often want to know which applications or services are
making certain connections or executing certain queries, the sys.sessions
table is being extended to include much more of this type of information.
Two main improvements on this topic: NULL value compression and a n-ary UNION operator. This is a very small but important step towards out wish to have “Dynamic numerical columns” (see the Wish List below).
NULL value compression: a new optimiser has been introduced to store NULL values separately and using only 1 bit, in stead of using 8 bits.
N-ary UNION operator: traditionally, the UNION operators are processed in a binary tree fashion, which allows limited level of parallelism and can cause redundantly data processing. To improve data ingestion from multiple sources, e.g. relational tables, external files or external database/cloud storage, we will introduce an n-ary UNION operator, called MUNION. MUNION handles multiple data sources as a list of relations, in which all items can be processed in parallel. With MUNION, we can improve query time a lot, because the relational trees can be much simpler and shorter, there will be no redundant data processing and it allows much more parallel execution. MUNION is also part of making MonetDB ready for its new parallel pipeline processing engine (see below).
We are extending MonetDB’s internal catalogue to distinguish functions by their full signatures including module name, function name, function type (e.g., aggregate function or data loader), parameters and their data types, return values and their types. To speed up function lookup, we use a tree based index.
We expect two or three bug fix releases for the current MonetDB version Dec2023, then, after the new feature release, one or two bug fix releases on top of the new version.
Up till now, MonetDB mostly uses some general malloc libraries to allocate or free memory when needed. Relying on the underlying (operating) systems for memory management may save some development effort. However, when operating, a MonetDB server can make a large amount of such systems calls, which can be expensive and prone to memory fragmentation.
Therefore, we have been developing a new memory allocator framework, which goal is to provide an efficient and unified memory allocation interface for all layers of a MonetDB server, while avoiding memory fragmentation.
This work is under active development and hopefully, we’ll be able to release it in 2025.
MonetDB stores and processes data one column at a time. So far, this scheme has proven to be highly efficient for analytical queries. However, there are still plenty of space for performance improvements.
We’ve been working on the implementation of a second query processing engine for MonetDB for quite some time. We call it the parallel pipeline engine, until we find a better name for it.
An SQL query is compiled into one or more pipelines of relational operators (multiple subpipelines are needed if there are blocking operators such as GROUP BY in the subquery). The pipeline engine (logically) divides a column into multiple blocks and pushes each block through one instance of the operator pipeline. In this way, the pipeline engine can process as many data blocks simultaneously as the number of available worker threads.
The pipeline engine is more complex to implement than the column-at-a-time engine, but for some queries and use cases, it has significant benefits, e.g. (much) higher parallelism, more fine-grained control over the CPU and memory resource usage, and the ability to stop query processing much earlier in case of an error of if the necessary answer has already be found.
The pipeline engine will coexist with the existing column based engine. Therefore, we will also develop new optimisers to determine the best execution strategy for a given query and its data.
This work is under active development and we don’t know yet when it will be ready for release.
For many years, MonetDB runs a testing system that tests different versions of MonetDB on all supported operating systems with different configurations every night. After having checked in some changes, MonetDB developers can check the “testweb” the next day to see if their changes have caused any problems and fix them accordingly. In addition, this “testweb” system is limited to functional tests, i.e. run the test queries once using small datasets, and presenting the testing results in a simple text based interface.
Ideally, we want to have a more extensive testing system that i) reacts to new check-ins (almost) instantly and recompiles the code and reruns the basic tests; ii) can conduct stress tests on MonetDB, including very-long-running tests and scalability tests; and iii) an graphical interface to compare testing results of different dates, configurations, etc. We’re working on various tools to realise this wish list and gradually extending our testing system.
MonetDB already supports many functions to load and query JSON data. However, JSON data is currently only stored as character strings, which are time consuming to process in large quantities.
Given our experience with MonetDB/XQuery, we have some pretty good ideas how we can significantly improve MonetDB’s support for JSON data with a native storage scheme, i.e. storing data of different types using corresponding binary data types. In addition, JSON data has much simpler structure than XML data, which gives us more optimisation opportunities.
Currently, MonetDB stores numerical data according to their SQL specification. For instance, the size of an SQL INTEGER is 4 bytes, so we use 4 bytes to store INTEGERs. However, this is the maximal size, not the minimal size.
If one has an INTEGER column of, say, ages, one does not need 4 bytes to store the ages. Instead, all ages (assuming current human ages) fix comfortably in a TINYINT, i.e. 1 byte.
In real world use cases, a numerical column often does not use all values from the domain of its data type. Therefore, much storage space can be gained by dynamically choose the optimal width for the storage of a numerical column based on the actual values of the column. Furthermore, less data also means less processing time.
A first challenge is to determine the optimal width for a given column, in particular, when should we expand or shrink a column due to updates and how to deal with outliers.
A second challenge is to determine an optimal strategy to store multiple columns of smaller data size in one column of larger size, e.g. one BITINT column of 8 bytes can hold 4 SMALLINT or 8 TINYINT columns. Such a storage scheme can be particularly useful to accelerate the search queries on embedding vectors.
We’d like to implement per column encryption. The interesting challenge here is to have a proper optimiser to automatically determine the optimal encryption and decryption algorithm based on the criteria such as data type, data statistics and level of security versus processing time.
The columnar storage scheme of MonetDB is a natural match to the streamings of time series data, so it’s relatively easy to enrich MonetDB’s functionality to also support time series data. The benefit is twofold.
At the user side, the applications would no longer need to use separate systems for their relational and time series data, which increases efficiency while decreasing costs. At the database side, the presence of two types of data opens up new opportunities to optimise the processing of one type of data using the information from the other type of data.
A first step towards supporting time series data in MonetDB is to add time series specific functions, such as missing values interpolation using some dynamic algorithms.
Further work include, for instance, optimising time series data processing using available relational data, and optimised storage scheme for time series data. The later topic is related to the “Dynamic numerical columns” below.
We are very happy to see Property Graph Queries being in included in SQL:2023, which makes it much easier to support graph queries in relational database systems. Similar to the time series support, extending a relational database with support graph queries benefits both the users and the database server. However, the main challenge is, of course, how to map the relational queries to graph nodes. Further challenges include efficient graph data storage scheme, graph query optimisation, dynamic plan generation, runtime filtering, etc.