With the announcement of MonetDB/e, the full-fledged embeddable column store for data analytics, it’s now high time to elaborate on its potential. In this short note, we illustrate its superb performance using the NYC taxi example.
What’s the programming interface?
Entering the world of embedded database systems, one cannot escape the pervasive use of SQLlite. With a claimed installed base of more than one billion, there is a lot of code out there and users are used to its programmatic interface. Therefore, we took the Python/SQLite3 interface as a target and implemented it in such a way that monetdbe becomes a simple drop-in module. MonetDB/e is based on the SQL standard and is compliant with the Python DB-API 2.0 specification. We included integration with pandas.py and nump.py as they are the preferred interfaces for non-database objects in data analytics.
How does it work?
Add MonetDB/e to your local setup using, e.g. pip install monetdbe, and there you go. Just import monetdbe in your Python code the way you do with every Python library. The complete specification of the interface is described in the manual. Here we illustrate the functionality by repeating the experiment reported by Uwe Korn in the NYC Taxi benchmark. The code for this experiment can be found in the monetdbe-examples repository. It is based on a single month of data from the New York City Taxi & Limousine Commission Trip Record, i.e. about 1.5GB of raw data and three statistical queries: counting distinct values, statistics over groups, and a regression query. The SQLite3 version was re-run on our benchmarking box, an Intel NIC I7 with 32GB RAM and 1TB NVMe disk, to ground the comparison. The data loading and query execution times are shown in the table below.
The “Python/MonetDB/e 0.8.4” and “Python/SQLite3 3.30.0” numbers illustrate the effect of the binary wheel drop-in after simply replacing all occurrences of ‘sqlite3’ with ‘monetdbe’. The remaining rows compare these numbers against different MonetDB server settings. It shows that the embedded version is equivalent to the full parallel version (default MonetDB setting), which exploits all cores of the machine. The last row illustrates the performance if we force MonetDB to use a single core only.
Switching to MonetDB/e brings new features. For example, it has a much richer collection of data types. In the original benchmark, type DOUBLE was used because SQLite does not support DECIMAL, which would have been sufficient and faster (see the comparison between “MonetDB + DOUBLE” and “MonetDB + DECIMAL”).
What does it all mean?
Such a simple taxi ride provides only a glimpse of the functionality and performance one might experience when joining the ride. Given that MonetDB/e encapsulates the full power of the MonetDB server code, we are assured that the performance and stability are maintained over its release cycles. The MonetDB server is evidently only marginally faster than the embedded Python variant. All the hard work is done by the server codebase. We can claim that, again, a columnar approach to data analytics as realised by MonetDB/e provides a >10x improvement out of the box over the alternative considered here, achieved with minimal code changes.
Why would you replace SQLite?
No one-size-fits-all database management system exists and MonetDB/e is not a silver bullet either. However, replacing SQLite with MonetDB/e should be considered in the following cases:
- You want to process larger amounts of data
- You want to exploit the multi-core hardware to do the job efficiently
- You want the breadth of functionality offered by a full-fledged SQL DBMS
- You want analytical functions, fast grouping, and fast loading
- You want a seamless path from prototype development using the in-memory mode to server-based production deployment
Not immediately convinced? Just try it out on a sample application or get inspired by the showcases in the monetdbe-examples repository.
What are the caveats?
There are a few caveats for SQLite programmers. SQLite and MonetDB/e do not align 100% on the same interpretation of the SQL standard. MonetDB/e is much more strict. This may result in minor differences in error handling or even some surprises in the results returned when it concerns null handling, rounding statistics and aggregate queries. However, addressing those issues in your application will only improve its quality. The MonetDB/e migration documentation contains a few hints and best practices.
With the Python interface of MonetDB/e we have greatly extended the scope of programmatic data analytics that requires data management and placed it at your fingertips. No overhead, easy programming and a pleasure to experience its performance.
Please drop us a note about your experiences and pass on this message to those who can benefit from the performance of the new taxi driver on the block.
The MonetDB team,
July 12, 2020