MonetDBLite for Python

For some time now, we have had MonetDBLite for R, which allows you to run a serverless MonetDB instance embedded in your R process. In the same vein, We are releasing MonetDBLite for Python today. Using MonetDBLite for Python, you can easily use MonetDB’s powerful features from within your Python client without having to run and maintain a separate database server. As both MonetDB and Python run within the same process, data transfer between the two is extremely efficient, as no actual transfer takes place.

Installation

MonetDBLite for Python is available for Linux, Mac OS X and Windows. It requires numpy to be installed. Installation of MonetDBLite is straightforward using pip:

pip install monetdblite

The source code and development version of MonetDBLite can be found on Github.

Usage

MonetDBLite can be used in two ways.

It is compliant with the Python DB API and can be used like any other database connector. Like SQLite, however, no actual connection to a database will be made. Instead, the “connect” creates a database and runs it within your process.
There is a separate minimalistic interface that can be called directly.

Python DB API

To use the Python DB API, a connection must be made to a database. The only required paramater for creating a connection is the location of the database directory. This must be a directory on the local file system that is write-accessible to your Python process. It can either be a non-existing folder (in which case a new database will be created) or an existing database (in which case the old database will be loaded).

Connecting to a Database

# create a new database or connect to an existing database in /tmp/dbfarm
import monetdblite
conn = monetdblite.connect('/tmp/dbfarm')

After creating a database, a cursor can be created and the database can be queried.

# create a new cursor c = conn.cursor()
# query the database
c.execute('SELECT * FROM tables')
# fetch the results and print them
print(c.fetchall())

Note that calling fetchall() converts the result set to a list of tuples, where each tuple represents a single row in the result set. This is the standard behavior of fetchall() as specified in the DB API. However, it is also inefficient when a lot of data is retrieved, as all the data has to be converted to Python objects. Instead, the following more efficient fetch functions are also supported:

# fetch the data as a dictionary of numpy arrays
# keys are column names, values are numpy arrays containing the data
print(c.fetchnumpy())
# fetch the data as a pandas dataframe, requires the pandas package
print(c.fetchdf())

Inserting Data

Inserting data into the database can be done using the standard DB API. In the following example, we create a table called ‘integers’ and insert the values [0..10] into the table.

c.execute('CREATE TABLE integers (i INTEGER)')
c.executemany('INSERT INTO integers VALUES (%s)', [[x] for x in range(10)])
c.execute('SELECT * FROM integers')
print(c.fetchall())
[[0], [1], [2], [3], [4], [5], [6], [7], [8], [9]]

However, like in the previous example, the standard DB API is slow here. If you have a set of large NumPy arrays or a large pandas DataFrame this method of inserting data will take a long time. For this reason, we support two additional functions that allow you to efficiently insert large amounts of data into MonetDB. Tables can be efficiently created from a dictionary of NumPy arrays or a Pandas DataFrame using the create function. Appending to tables can be done using the insert function.

import numpy
# create the table 'integers2' with elements [0..10]
c.create('integers2', {'i': numpy.arange(10)})
c.execute('SELECT * FROM integers2')
print(c.fetchnumpy())

# {'i': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}
# append to the table using a pandas data frame
import pandas
c.insert('integers2',
pandas.DataFrame.from_dict( {'i': numpy.arange(10) + 10}))
c.execute('SELECT * FROM integers2')
print(c.fetchnumpy())
#{'i': array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])}

The create statement will infer the column names and column types from the data that is passed to it, while the insert statement will attempt to convert the passed data to the types of the existing table. If you want columns in your table to have specific types, it is a good idea to explicitly create it using a ‘CREATE TABLE’ statement, and then appending the data to it using an insert statement.

Transactions

Transactions are supported as well. A transaction can be started using the transaction function. After which the transaction can be either commited using the commit function, or cancelled using the rollback function. For example, the following code will have no permanent effect on the database because the transaction is rolled back.

# start a transaction
c.transaction()
# create a table 'integers3' with content
c.create('integers3', {'i': numpy.arange(10)})
# rollback the transaction
c.rollback()
# table 'integers3' no longer exists

Shutting Down The Database

Because of global variables, only a single instance of MonetDBLite can run at a time. However, the currently running database can be shutdown by calling the close() function on the connector. After shutting down a database in a Python process, a new database can be opened by connecting to it again.

# shut down the database
conn.close()
# connect to a different database directory
conn = monetdblite.connect('/tmp/different-dbfarm')

Simple API In addition to the Python DB API, there is also a simple API that can be used to use MonetDBLite. Note that the Python DB API implementation directly calls this simple API as well, so there should not be a major performance difference between the simple API and the Python DB API unless the Python DB API is used in a way such that Python Objects are constructed from data in NumPy arrays.

It is recommended to use the Python DB API as it is the standard way of communicating with databases in Python. That being said, the simple API is, well, simpler. As such, it might be preferable if you don’t care about porting your code to other databases.

Initialization

MonetDBLite can be initialized using the monetdblite.init() command, pointing at a directory.

import monetdblite
# initialize monetdblite
monetdblite.init('/path/to/database')

Retrieving Data

The database can then be queried using the monetdblite.sql() command. The return value of this command is a dictionary of NumPy masked arrays, where the keys are the column names and the values are the actual values.

monetdblite.sql('SELECT * FROM tables')

##Inserting Data

Data can be inserted using the create() and insert() functions. These are identical to the create() and insert() functions described above.

# create the integers table with a single column (i)
# and insert 100 values into the
column monetdblite.create('integers', {'i': numpy.arange(100)})
# insert 100 values into the table 'integers'
monetdblite.insert('integers', {'i': numpy.arange(100)})
# retrieve the column again
monetdblite.sql('SELECT * FROM integers')

Shutdown

The database can be shutdown using the shutdown() command.

# shutdown the currently running monetdblite instance monetdblite.shutdown()
# initialize monetdblite again with a different database
monetdblite.init('/path/to/different/database')

Multiple Clients

By default, monetdblite uses a single client for each query. A single client can only run a single query at a time within a single transaction. It is possible to separate queries with different clients, allowing you to run multiple queries and transactions in parallel. First, a client must be created with the monetdblite.connectclient() command. The client can then be passed to subsequent queries using the optional client parameter.

# create a new client connection
conn = monetdblite.connectclient()
# use the connection in a
query monetdblite.sql('SELECT * FROM tables', client = conn)
# close the connection del conn

About the Author Mark Raasveldt is a PhD student at the CWI. If you have any questions regarding this blogpost, you can contact him at m.raasveldt@cwi.nl.