Skip to main content

Loading Bulk Data

There are many ways in which you can insert data into an SQL table.  Using the best method for your application makes a difference between a slow and pleasantly fast experience. The pros and cons of the various methods supported by MonetDB are explained here.

Simple value insertion

The easiest way to insert data into a table defined over three integer columns is to use the INSERT INTO SQL query:

INSERT INTO table_name VALUES (1, 2, 3);

This method is well suited for low volume inserts. However, this is a seriously inefficient way of doing things in MonetDB when you intend to load thousands or even millions of tuples at the same time. MonetDB, by default, runs in auto-commit mode. This means that each SQL insert query is a full-fledged database transaction. This means that this query changes the database and necessarily causes a write to the disk to safeguard the change. This can become a serious bottleneck for hard disk drives. A typical 7,200 rpm SATA drive can do about 75 to 100 I/O operations per second (See Wikipedia). Since each INSERT is at least one I/O operation (depending on, among others, the underlying operating system), this limits throughput severely. Using a SSD instead improves the situation somewhat.

A much better way to do lots of inserts is to make sure the inserts are bundled into a single transaction. This can easily be done by surrounding the inserts with a START TRANSACTION and COMMIT, as in:

START TRANSACTION;
INSERT INTO table VALUES (1, 2, 3);
...
COMMIT;

An other way to accomplish the same result is to clear the auto-commit mode in the server. How this is done depends on how you connect to the server.  In ODBC this can be done with a call to the function SQLSetConnectAttr(), in JDBC it can be done with a call to the setAutoCommit()method. In the mclient tool, it can be done when using it interactively with the \acommand and by using the -a command line option.

Note that the SQL variant

INSERT INTO table VALUES (1, 2, 3), (3, 4, 5), (6, 7, 8);

is also supported by MonetDB. However, this variant is not to be recommended for large numbers of tuples either. It uses a lot of stack space on the server, and stack space is a very limited commodity. In other words, if you try to insert too many tuples in one go, the query will fail, or worse, the server will crash.

A major cost factor in all insertion statements is to decipher the call structure and  turning the literals into the internal format. The overhead of the former can be removed with a slightly faster way of inserting data using a prepared query. In ODBC you would first call:

SQLPrepare(hdl, "INSERT INTO table VALUES (?, ?, ?)", SQL_NTS);

Then you  need to call SQLBindParameter() for each of the parameters in the query, and finally call

SQLExecute(hdl);

for each insert.  As before, all insert queries should be done in a single transaction.

In JDBC a similar sequence using a PreparedStatement can be used. Also consider using the JDBC batching functionality.

Error handling

Unfortunately input files may contain errors. They are collected in a predefined table that contains a copy of the failed line and line number in the source file. The rejects table should be explicitly cleared before you load a new file.

SELECT * from sys.rejects;

CALL sys.clearrejects();

Parallel data insertion

The methods discussed so far share another disadvantage: even if the server has multiple CPUs/cores, the server will only use one at a time for this simple query.  A better way is to use the  COPY INTO statement.  COPY INTO is a single query that takes a complete file (or stream) of comma-separated values (CSV or user defined delimiters) and inserts the data in one go using system cores in parallel.  Since COPY INTO is a single query, whether or not auto-commit mode is enabled doesn't matter. What matters is the intended use of the bulk loaded data. If you plan to use the data only during a single session or in-memory processing then use a CREATE GLOBAL TEMPORARY TABLE ... ON COMMIT PRESERVE ROWS. It avoids the IO to securely store the data on disk for later use. For most cases, though, using a CREATE TABLE despite being slower upon first load, gives great benefits when you continue querying in the future.

COPY INTO comes in many variants.  The easiest form is:

COPY INTO table FROM 'file';

The 'file' must be an SQL string that contains the absolute path name of the file to be inserted. The file is read directly by the server, and so has to be accessible to the server's file-system.  By default, the field delimiter is '|', i.e. a vertical bar, and the record delimiter is '\n', i.e. a newline. The delimiters can be changed:

COPY INTO table from 'file' USING DELIMITERS '|','\n';

Often CSV-encoded files use quote characters to surround field values. By default, no quote characters are recognized, but by using the following form, they will be recognized:

COPY INTO table from 'file' USING DELIMITERS '|','\n','"';

When using quote characters, the other delimiters can be part of the value, and the quote character itself can be part of the value if it is preceded by a \ (backslash) or if it is doubled.  The backslash itself must also be doubled.

Often NULL values need to be inserted as well.  By default, a NULL value is indicated by using NULL in the CSV file, but often the actual value used is different, e.g. 'no value'.  This can be indicated by using:

COPY INTO table from 'file' USING DELIMITERS '|','\n','"' NULL AS '';

When the data is not directly accessible to the server but is accessible to the client, the 'file' in the above queries can be replaced by STDIN:

COPY INTO table FROM STDIN;

When using STDIN, the contents of the CSV file should follow the query.  This form is not directly usable in ODBC or JDBC, but see SQLcopyinto.java for an example of how to use this from Java.  If using the mclient tool and the CSV data is in a file by itself, you can use the command (also see the mclient manual):

mclient -d database  -s  "COPY  INTO  table  FROM  STDIN  USING  DELIMITERS ',','\\n','\"'" - < file

The COPY INTO forms mentioned so far share one important drawback.  When a lot of data is to be inserted, the server doesn't know how much memory to allocate for the tables, and so will likely allocate too little.  This means that during the insertion process, the server has to grow the allocated memory area.  This can be an expensive operation.  Therefore, it is better to give the server a count of how many records are to be inserted:

COPY n RECORDS INTO table FROM 'file';

Here n should be a number that is at least as large as the actual number of records to be inserted.  If the file contains more than n records, only n will be inserted, if the file contains fewer, all values will be inserted.  Giving a higher number is especially useful if multiple COPY INTO queries are to be done on the same table.  The first COPY INTO, when the table is still empty, should be the total count of to-be-inserted values so that the server will allocate enough memory when the tables are first created (they are only really created once data is inserted). This form is the fastest way of inserting data into MonetDB.

One more elaboration of the COPY INTO form is to specify an offset into the CSV file.  Often CSV files contain header information that is not part of the actual data.  This header information can be skipped by using the form

COPY n OFFSET m RECORDS INTO table FROM 'file';

Here n is the number of records to insert and m is the first record to be inserted where the first record in the file is called 1 (i.e. counting is 1 based).

For completeness, if the number of records is unknown but there is a header that needs to be skipped, the form is

COPY OFFSET m INTO table FROM 'file';