There are many ways in which you can insert data into an SQL table in MonetDB. Here we will explain the pros and cons of various methods.
The easiest way to insert data is to use the
INSERT INTO table VALUES (1, 2, 3);
The benefit is clear: this is very straightforward. However, this is a seriously inefficient way of doing things in MonetDB.
MonetDB, by default, runs in auto-commit mode. This means that each query is a database transaction. This in turn means that each query that changes the database necessarily causes a write to the disk in order to safeguard the change. This is a serious bottleneck. As you can see on Wikipedia, a typical 7,200 rpm SATA drive can do about 75 to 100 I/O operations per second. Since each INSERT is at least one I/O operation (depending on, among others, the underlying operating system), this limits throughput severely.
A much better way to do lots of inserts is to make sure the inserts are bundled into a single transaction. This can easily be done by surrounding the inserts with a
COMMIT, as in:
INSERT INTO table VALUES (1, 2, 3);
An other way of accomplishing the same result is to clear the auto-commit mode in the server. How this is done depends on how you connect to the server. In ODBC this can be done with a call to the function
SQLSetConnectAttr(), in JDBC it can be done with a call to the
setAutoCommit()method. In the
mclienttool, it can be done when using it interactively with the
\acommand, and from the Apr2012 release forward, by using the
-acommand line option.
Note that the variant
INSERT INTO table VALUES (1, 2, 3), (3, 4, 5), (6, 7, 8);
is also supported by MonetDB. However, this variant is not to be recommended for large numbers of tuples since it uses a lot of stack space on the server, and stack space is a very limited commodity. In other words, if you try to insert too many tuples in one go, the query will fail, or worse, the server will crash.
A slightly faster way of inserting data is to use a prepared query. In ODBC you would call:
SQLPrepare(hdl, "INSERT INTO table VALUES (?, ?, ?)", SQL_NTS);
you will also need to call
SQLBindParameter()for each of the parameters in the query, and then call
for each insert. As before, all insert queries should be done in a single transaction.
In JDBC a similar sequence using a PreparedStatement can be used. Also consider using the JDBC batching functionality.
The methods discussed so far share another disadvantage: even if the server has multiple CPUs/cores, the server will only use a single one at a time because each insert is a separate query. There is also a way to more fully use the CPU power of the system by using COPY INTO. COPY INTO is a single query that takes a complete file of comma-separated values (CSV) and inserts the data in one go. The values don't actually have to be comma-separated, any delimiter will do. Since COPY INTO is a single query, whether or not auto-commit mode is enabled doesn't matter.
COPY INTO comes in many variants. The easiest form is:
COPY INTO table FROM 'file';
The 'file' should be an SQL string that contains the absolute path name of the file to be inserted. The file is read directly by the server, and so has to be accessible to the server. By default, the field delimiter is '|', i.e. a vertical bar, and the record delimiter is '\n', i.e. a newline. The delimiters can be changed:
COPY INTO table from 'file' USING DELIMITERS '|','\n';
Often CSV-encoded files use quote characters to surround field values. By default, no quote characters are recognized, but by using the following form, they will be recognized:
COPY INTO table from 'file' USING DELIMITERS '|','\n','"';
When using quote characters, the other delimiters can be part of the value, and the quote character itself can be part of the value if it is preceded by a \ (backslash) or if it is doubled. The backslash itself must also be doubled.
Often NULL values need to be inserted as well. By default, a NULL value is indicated by using NULL in the CSV file, but often the actual value used is different, e.g. no value. This can be indicated by using:
COPY INTO table from 'file' USING DELIMITERS '|','\n','"' NULL AS '';
When the data is not directly accessible to the server but is accessible to the client, the 'file' in the above queries can be replaced by STDIN:
COPY INTO table FROM STDIN;
When using STDIN, the contents of the CSV file should follow the query. This form is not directly usable in ODBC or JDBC, but see SQLcopyinto.java for an example of how to use this from Java. If using the mclient tool and the CSV data is in a file by itself, you can use the command (also see the mclient manual):
mclient -d database -s "COPY INTO table FROM STDIN USING DELIMITERS ',','\\n','\"'" - < file
The COPY INTO forms mentioned so far share one important drawback. When a lot of data is to be inserted, the server doesn't know how much memory to allocate for the tables, and so will likely allocate too little. This means that during the insertion process, the server has to grow the allocated memory area. This can be an expensive operation. Therefore, it is better to give the server a count of how many records are to be inserted:
COPY n RECORDS INTO table FROM 'file';
Here n should be a number that is at least as large as the actual number of records to be inserted. If the file contains more than n records, only n will be inserted, if the file contains fewer, all values will be inserted. Giving a higher number is especially useful if multiple COPY INTO queries are to be done on the same table. The first COPY INTO, when the table is still empty, should be the total count of to-be-inserted values so that the server will allocate enough memory when the tables are first created (they are only really created once data is inserted).
This form is the fastest way of inserting data into MonetDB.
One more elaboration of the COPY INTO form is to specify an offset into the CSV file. Often CSV files contain header information that is not part of the actual data. This header information can be skipped by using the form
COPY n OFFSET m RECORDS INTO table FROM 'file';
Here n is the number of records to insert and m is the first record to be inserted where the first record in the file is called 1 (i.e. counting is 1 based).