Loading Bulk Data

Loading Bulk Data sjoerd Fri, 03/02/2012 - 13:52

There are many ways in which you can insert data into an SQL table.  Using the best method for your application makes a difference between a slow and pleasantly fast experience. The pros and cons of the various methods supported by MonetDB are explained here.

Simple value insertion

The easiest way to insert data into a table defined over three integer columns is to use the INSERT INTO SQL query:

INSERT INTO table_name VALUES (1, 2, 3);

This method is well suited for low volume inserts. However, this is a seriously inefficient way of doing things in MonetDB when you intend to load thousands or even millions of tuples at the same time. MonetDB, by default, runs in auto-commit mode. This means that each SQL insert query is a full-fledged database transaction. This means that this query changes the database and necessarily causes a write to the disk to safeguard the change. This can become a serious bottleneck for hard disk drives. A typical 7,200 rpm SATA drive can do about 75 to 100 I/O operations per second (See Wikipedia). Since each INSERT is at least one I/O operation (depending on, among others, the underlying operating system), this limits throughput severely. Using a SSD instead improves the situation somewhat.

A much better way to do lots of inserts is to make sure the inserts are bundled into a single transaction. This can easily be done by surrounding the inserts with a START TRANSACTION and COMMIT, as in:

START TRANSACTION;
INSERT INTO table VALUES (1, 2, 3);
...
COMMIT;

An other way to accomplish the same result is to clear the auto-commit mode in the server. How this is done depends on how you connect to the server.  In ODBC this can be done with a call to the function SQLSetConnectAttr(), in JDBC it can be done with a call to the setAutoCommit()method. In the mclient tool, it can be done when using it interactively with the \acommand and by using the -a command line option.

Note that the SQL variant

INSERT INTO table VALUES (1, 2, 3), (3, 4, 5), (6, 7, 8);

is also supported by MonetDB. However, this variant is not to be recommended for large numbers of tuples either. It uses a lot of stack space on the server, and stack space is a very limited commodity. In other words, if you try to insert too many tuples in one go, the query will fail, or worse, the server will crash.

A major cost factor in all insertion statements is to decipher the call structure and  turning the literals into the internal format. The overhead of the former can be removed with a slightly faster way of inserting data using a prepared query. In ODBC you would first call:

SQLPrepare(hdl, "INSERT INTO table VALUES (?, ?, ?)", SQL_NTS);

Then you  need to call SQLBindParameter() for each of the parameters in the query, and finally call

SQLExecute(hdl);

for each insert.  As before, all insert queries should be done in a single transaction.

In JDBC a similar sequence using a PreparedStatement can be used. Also consider using the JDBC batching functionality.

Error handling

Unfortunately input files may contain errors. They are collected in a predefined table that contains a copy of the failed line and line number in the source file. The rejects table should be explicitly cleared before you load a new file.

SELECT * from sys.rejects;

CALL sys.clearrejects();

Parallel data insertion

The methods discussed so far share another disadvantage: even if the server has multiple CPUs/cores, the server will only use one at a time for this simple query.  A better way is to use the  COPY INTO statement.  COPY INTO is a single query that takes a complete file (or stream) of comma-separated values (CSV or user defined delimiters) and inserts the data in one go using system cores in parallel.  Since COPY INTO is a single query, whether or not auto-commit mode is enabled doesn't matter. What matters is the intended use of the bulk loaded data. If you plan to use the data only during a single session or in-memory processing then use a CREATE GLOBAL TEMPORARY TABLE ... ON COMMIT PRESERVE ROWS. It avoids the IO to securely store the data on disk for later use. For most cases, though, using a CREATE TABLE despite being slower upon first load, gives great benefits when you continue querying in the future.

COPY INTO comes in many variants.  The easiest form is:

COPY INTO table FROM 'file';

The 'file' must be an SQL string that contains the absolute path name of the file to be inserted. The file is read directly by the server, and so has to be accessible to the server's file-system.  By default, the field delimiter is '|', i.e. a vertical bar, and the record delimiter is '\n', i.e. a newline. The delimiters can be changed:

COPY INTO table from 'file' USING DELIMITERS '|','\n';

Often CSV-encoded files use quote characters to surround field values. By default, no quote characters are recognized, but by using the following form, they will be recognized:

COPY INTO table from 'file' USING DELIMITERS '|','\n','"';

When using quote characters, the other delimiters can be part of the value, and the quote character itself can be part of the value if it is preceded by a \ (backslash) or if it is doubled.  The backslash itself must also be doubled.

Often NULL values need to be inserted as well.  By default, a NULL value is indicated by using NULL in the CSV file, but often the actual value used is different, e.g. 'no value'.  This can be indicated by using:

COPY INTO table from 'file' USING DELIMITERS '|','\n','"' NULL AS '';

When the data is not directly accessible to the server but is accessible to the client, the 'file' in the above queries can be replaced by STDIN:

COPY INTO table FROM STDIN;

When using STDIN, the contents of the CSV file should follow the query.  This form is not directly usable in ODBC or JDBC, but see SQLcopyinto.java for an example of how to use this from Java.  If using the mclient tool and the CSV data is in a file by itself, you can use the command (also see the mclient manual):

mclient -d database  -s  "COPY  INTO  table  FROM  STDIN  USING  DELIMITERS ',','\\n','\"'" - < file

The COPY INTO forms mentioned so far share one important drawback.  When a lot of data is to be inserted, the server doesn't know how much memory to allocate for the tables, and so will likely allocate too little.  This means that during the insertion process, the server has to grow the allocated memory area.  This can be an expensive operation.  Therefore, it is better to give the server a count of how many records are to be inserted:

COPY n RECORDS INTO table FROM 'file';

Here n should be a number that is at least as large as the actual number of records to be inserted.  If the file contains more than n records, only n will be inserted, if the file contains fewer, all values will be inserted.  Giving a higher number is especially useful if multiple COPY INTO queries are to be done on the same table.  The first COPY INTO, when the table is still empty, should be the total count of to-be-inserted values so that the server will allocate enough memory when the tables are first created (they are only really created once data is inserted). This form is the fastest way of inserting data into MonetDB.

One more elaboration of the COPY INTO form is to specify an offset into the CSV file.  Often CSV files contain header information that is not part of the actual data.  This header information can be skipped by using the form

COPY n OFFSET m RECORDS INTO table FROM 'file';

Here n is the number of records to insert and m is the first record to be inserted where the first record in the file is called 1 (i.e. counting is 1 based).

For completeness, if the number of records is unknown but there is a header that needs to be skipped, the form is

COPY OFFSET m INTO table FROM 'file';

CSV Bulk Loads

CSV Bulk Loads zhang Fri, 09/01/2017 - 15:01

The COPY INTO command enables fast insertion of multiple tuples from a text file or the standard input. Each tuple in the input file is terminated by a record-separator (default '\n') and fields are separated by the field separator (default '|' ). The field values should adhere to the syntax for value literals. Alternative separators should adhere to the SQL lexical syntax for string values. A different NULL value representation can be specified using the NULL AS null_string option. Furthermore, the fields are optionally enclosed with a user defined quote character. The text file should use UTF-8 encoding if specified as file_name, and the same encoding as mclient is used if data are read FROM STDIN.

The input files must be accessible by the database server. Therefore, they must reside on or be accessible to the machine on which the database server is running. Also, the input files must be identified with an absolute path name. The STDIN file designator reads data streaming from the client application. The end of file (EOF) character signifies the end of the sequence.

The full syntax for the COPY INTO command is the following:

COPY [ int_val [ OFFSET int_val ] RECORDS ] INTO table_name
          [ '(' column_name ',' ... ')' ] FROM
          ( file_name ',' ... | STDIN ) [ '(' header [ STRING ] ',' ... ')' ]
          [ [USING] DELIMITERS  field_separator
            [',' record_separator [ ',' string_quote ]]]
          [ NULL [AS] null_string ] [ LOCKED ] [ BEST EFFORT ] [ NO CONSTRAINT ]

Note that the various features must be specified in the order shown above. For instance the query:

sql>COPY INTO tbl FROM STDIN BEST EFFORT NULL AS '';
syntax error, unexpected sql NULL, expecting SCOLON in: "copy into tbl from stdin best effort null"

is not accepted by the SQL parser because the BEST EFFORT feature is specified before the NULL AS feature.

COPY INTO from different input sources

COPY INTO is able to read data from different input sources, including standard input, a single CVS file, multiple CSV files and compressed CSV files.  For each example, we create the following table to start with:

sql>CREATE TABLE tbl (i INT, s STRING, d DECIMAL(5,2));
operation successful (3.719ms)

From standard input

The most straightforward way is to read data from the standard input STDIN:

sql>COPY INTO tbl FROM STDIN;
more>1|abc|0.1
more>2|def|0.2
more>3|ghi|0.3
more>^D
3 affected rows (116.271ms)

sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | abc  |    0.10 |
|    2 | def  |    0.20 |
|    3 | ghi  |    0.30 |
+------+------+---------+
3 tuples (3.633ms)

From a CSV file

Assume data are stored in a file named /tmp/example1.csv:

$ cat /tmp/example1.csv
1|dfs|3.1
2|ewe|3.23
3|erw|2.43

We can load the data using the following COPY INTO query:

sql>COPY INTO tbl FROM '/tmp/example1.csv';
3 affected rows (47.725ms)
sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
+------+------+---------+
3 tuples (3.287ms)

From multiple files

If the data are distributed across multiple files and have the same schema, one can load them all at once. Assume we have the following example files /tmp/example2.1.csv and /tmp/example2.2.csv:

$ cat /tmp/example2.1.csv
1|dfs|3.1
2|ewe|3.23
3|erw|2.43
$ cat /tmp/example2.2.csv
4|dfs|3.2
5|ewe|3.3
6|erw|2.3

we can use them in the COPY INTO command as follows:

sql>COPY INTO tbl FROM '/tmp/example2.1.csv', '/tmp/example2.2.csv';
6 affected rows (152.470ms)
sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
|    4 | dfs  |    3.20 |
|    5 | ewe  |    3.30 |
|    6 | erw  |    2.30 |
+------+------+---------+
6 tuples (2.362ms)

From compressed files

One can pass compressed files to COPY INTO to be loaded. Supported compressions include bzip2, gzip, lzma and snappy.

First, let us compress our example CSV file:

$ bzip2 -k /tmp/example1.csv
$ gzip -k /tmp/example1.csv
$ xz  -k /tmp/example1.csv
$ ls /tmp/example1.csv.*
/tmp/example1.csv.bz2    /tmp/example1.csv.gz /tmp/example1.csv.xz

Now, we can load both files in one COPY INTO statement:

sql>COPY INTO tbl FROM '/tmp/example1.csv.bz2', '/tmp/example1.csv.gz', '/tmp/example1.csv.xz';
9 affected rows (93.635ms)
sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
+------+------+---------+
9 tuples (1.903ms)

Specifying the number of records

It is strongly recommended to specify the maximum number of records to be inserted. This allows the database server to allocate enough space for the table in advance, so that it can avoid guessing and subsequent extending of table space, which may involve potentially expensive copying. If the exact number is unknown, a (slight) overestimation is generally better for performance than an underestimation.

If the number of actual records in the input source is smaller than the number of records specified, COPY INTO will simply stop at the end of input.

If the number of actual records in the input source is larger than the number of records specified, COPY INTO will only load the number of specified records.

The example below shows how to specify the number of records (the earlier example CSV file is used). First we give an overestimation, then we give an underestimation:

sql>COPY 4 RECORDS INTO tbl FROM '/tmp/example1.csv';
3 affected rows (46.254ms)
sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
+------+------+---------+
3 tuples (3.408ms)
sql>DELETE FROM tbl;
3 affected rows (2.488ms)
sql>COPY 2 RECORDS INTO tbl FROM '/tmp/example1.csv';
2 affected rows (44.747ms)
sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
+------+------+---------+
2 tuples (4.205ms)

Specifying an offset

A portion of the input file can be skipped using the OFFSET feature. Specifying an offset n directs COPY INTO to start reading at record n. The offset of the first record is 1. More formally, the offset directive counts n unquoted record separators. Here we use a file named /tmp/example2.csv:

$ cat /tmp/example2.csv
i|s|d
1|dfs|3.1
2|ewe|3.23
3|erw|2.43

Since the first line of the file does not have the same schema as the table, it must be ignored. So, we specify that the COPY INTO should start at the second record (i.e. 'OFFSET 2'):

sql>COPY 3 OFFSET 2 RECORDS INTO tbl FROM '/tmp/example2.csv';
3 affected rows (46.947ms)
sql>SELECT * from tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
+------+------+---------+
3 tuples (3.530ms)

Specifying columns to load and their order

By default, a CSV file must contain data for all columns specified in a table, and in the same order as when the table was created. Additional syntax is supported to load data from CSV files that do not satisfy these requirements. Here we show with several examples how this feature works. 

Scenario 1: skip column in the input file

Assume we have the following input file:

$cat /tmp/cars.csv
Year|Make|Model
1997|Ford|E350
2000|Mercury|Cougar

However, when loading this file, we want to skip the second column containing the "Make" information:

sql>CREATE TABLE cars (y STRING, make STRING, model STRING);
operation successful (4.148ms)
sql>COPY 2 OFFSET 2 RECORDS INTO cars(y, model) FROM '/tmp/cars.csv';
2 affected rows (52.367ms)
sql>SELECT * FROM cars;
+------+------+--------+
| y    | make | model  |
+======+======+========+
| 1997 | null | E350   |
| 2000 | null | Cougar | 
+------+------+--------+
2 tuples (2.141ms)

Scenario 2: missing column and out of order

In the following CSV file, data for the column "Model" is missing, while the data for the columns "Year" and "Make" are in the reverse order as specified when the table "cars" was created:

$ cat /tmp/cars2.csv
Make|Year
Ford|1997
Mercury|2000

The following example shows how we can tell COPY INTO which columns are contained in the input CSV file and in which order:

sql>COPY 2 OFFSET 2 RECORDS INTO cars(y,make) from '/tmp/cars2.csv'(make,y);
2 affected rows (33.402ms)
sql>select * from cars;
+------+---------+-------+
| y    | make    | model |
+======+=========+=======+
| 1997 | Ford    | null  |
| 2000 | Mercury | null  | 
+------+------+----------+
2 tuples (1.968ms)

Customise delimiters

The COPY INTO command is versatile enough to accept records in many formats. The basic idea is that each field of the record is separated from other records using a specified character string, and each record is separated from other records using another character string. Moreover since string values can contain either of those characters strings, they might need to be quoted, so as to be read correctly.

The user can specify the format of the file with

[[USING ] DELIMITERS field_separator [, record_separator [, string_quote ]]]

syntax. The input syntax should then comply to the following grammar:

[[[quote] [[escape]char]* [quote]] feldspar]* record separator

Quote characters in quoted fields may be escaped with a backslash. Field and record separators can be embedded in quoted fields.

By default, fields are separated by the pipe character ('|'), records by the new line ('\n') character.

The queries below show how to i) change the default field separator with a single character; ii) change the default field separator with a string of two characters; and iii) change the default string quote:

sql>COPY INTO tbl FROM STDIN DELIMITERS ',';
more>1,abc,0.1
more>2,def,0.2
more>2 affected rows (41.766ms)

sql>COPY INTO tbl FROM STDIN DELIMITERS ';;';
more>3;;ghi;;0.03
more>4;;jkl;;0.04
more>2 affected rows (39.128ms)

sql>COPY INTO tbl FROM STDIN DELIMITERS ';;','\n','|';
more>5;;|klm|;;0.05
more>6;;|opq|;;0.06
more>2 affected rows (37.899ms)

sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | abc  |    0.10 |
|    2 | def  |    0.20 |
|    3 | ghi  |    0.03 |
|    4 | jkl  |    0.04 |
|    5 | klm  |    0.05 |
|    6 | opq  |    0.06 |
+------+------+---------+
6 tuples (3.281ms)

Customise representation of the NULL value

By default the character sequence "NULL" (case insensitive) represent the NULL value:

sql>CREATE TABLE str (s STRING);
operation successful (2.712ms)
sql>COPY INTO str FROM STDIN;
more>NULL
more>null
more>Null
more>nUll
more>4 affected rows (50.040ms)

sql>SELECT * FROM str;
+------+
| s    |
+======+
| null |
| null |
| null |
| null |
+------+
4 tuples (4.095ms)
sql>SELECT * FROM str WHERE s = 'null';
+---+
| s |
+===+
+---+
0 tuples (1.817ms)
sql>SELECT * FROM str WHERE s IS NULL;
+------+
| s    |
+======+
| null |
| null |
| null |
| null |
+------+
4 tuples (2.491ms)

The NULL AS feature allows a user to specify an alternative string as the representation of the NULL value. Whenever the alternative string is read during the loading process, the NULL value is inserted into the table. Any valid string can be specified as the representation of the NULL value.

In the example below, the first COPY INTO treats an empty string ('') as the NULL value, while the second COPY INTO uses a long text to represent the NULL value, which typically can be used to prevent any text in the input to be treated as the NULL value. The keyword AS is optional.

sql>CREATE TABLE nullas (i INT, s STRING);
operation successful (9.010ms)
sql>COPY INTO nullas FROM STDIN NULL AS '';
more>1|
more>2|null
more>3|"NuLL"
more>^D
3 affected rows (40.528ms)
sql>COPY INTO nullas FROM STDIN NULL 'this-string-does-not-exist-in-the-input';
more>4|
more>5|null
more>6|NULL
more>^D
3 affected rows (43.968ms)

sql>SELECT * FROM nullas;
+------+--------+
| i    | s      |
+======+========+
|    1 | null   |
|    2 | null   |
|    3 | "NuLL" |
|    4 |        |
|    5 | null   |
|    6 | NULL   |
+------+--------+
3 tuples (1.999ms)
sql>SELECT * FROM nullas WHERE s IS NULL;
+------+------+
| i    | s    |
+======+======+
|    1 | null |
+------+------+
1 tuple (4.391ms)
sql>SELECT * FROM nullas WHERE s = 'null';
+------+------+
| i    | s    |
+======+======+
|    2 | null |
|    5 | null |
+------+------+
1 tuple (0.969ms)
sql>SELECT * FROM nullas WHERE s = '"NuLL"';
+------+--------+
| i    | s      |
+======+========+
|    3 | "NuLL" |
+------+--------+
1 tuple (3.484ms)
sql>SELECT * FROM nullas WHERE s = 'NULL';
+------+------+
| i    | s    |
+======+======+
|    6 | NULL |
+------+------+
1 tuple (3.484ms)

Ignore errors in the input with the BEST EFFORT mode

Normally if there is an error in the input, such as a missing field, COPY INTO will fail. Using the BEST EFFORT mode, COPY INTO continues parsing the input, and writes all the records that failed in the sys.rejects table.

sql>CREATE TABLE cars (y STRING, make STRING, model STRING);
operation successful (0.734ms)
sql>COPY INTO cars FROM STDIN DELIMITERS ',','\n' BEST EFFORT;
more>2000,
more>2004,Honda,Pilot
more>2001,Honda,Civic
more>,,,
more>3 affected rows (28.604ms)
sql>SELECT * FROM cars;
+------+-------+-------+
| y    | make  | model |
+======+=======+=======+
| 2004 | Honda | Pilot |
| 2001 | Honda | Civic |
|      |       |       |
+------+-------+-------+
3 tuples (2.233ms)

Above example shows that only the valid records are retained. In this scenario the first record is omitted.

sql>SELECT * FROM rejects;
+-------+-------+------------------------+--------+
| rowid | fldid | message                | input  |
+=======+=======+========================+========+
|     0 |     1 | Column value 2 missing | 2000,  |
+-------+-------+------------------------+--------+

Tips to boost performance

Here are several tips to possibly further improve the performance of bulk data loading:

  • Add constraints after data loading: if one needs to load multiple chunks of data, it is advisable to add the integrity constraints to the table after all files of this bunch have been loaded. The ALTER statements perform bulk integrity checking, which is often more efficiently than conducting the check after each small amount of data are loaded. When adding more data to an existing table, one can temporarily remove existing constraints on the table, load the new data and finally re-add the constraints.
  • LOCKED mode for single user: in many bulk loading situations, the original data file can be saved as a backup or recreated for disaster handling. This relieves the database system from having to prepare for recovery and to save significant storage space. The LOCKED qualifier can be used in this situation to skip the logging operation that is done by default. (WARNING: this feature must be used with great care. When using the LOCKED mode, it is the responsibility of the database user to make sure that there is currently only a single client connection accessing the database. The database server will not check this.)

Binary Bulk Loads

Binary Bulk Loads mk Sun, 03/28/2010 - 23:29

A database is often populated with the SQL COPY statement using an ASCII representation of a table with column and record separators, e.g. a CSV file. This scheme is preferred for almost all situations. Its performance is mostly determined by parsing cost of turning ASCII value representation into the native binary one. The MonetDB copy process is highly optimized for multi-core systems, where several threads in parallel handle variable length records efficiently.

When large tables are migrated between MonetDB instances, or when large tables are produced by a trusted external program, a slightly faster loading track can be used instead. This saves rendering of data into ASCII and subsequent parsing of the data being exchanged. For this to work, the user creates the binary version of the BAT storage and 'attach' it to the SQL catalog using a variation of the COPY command.

To illustrate, consider the following snippet:

create table Tmp( i integer, f real, s string); 
copy binary into Tmp from ('path_to_file_i', 'path_to_file_f', 'path_to_file_s');

Each attachment file is produced by a program that writes the binary image of the BAT directly, i.e. a binary dump of an C-array. For the SQL types char (1 byte), tinyint (8-bits), smallint (16-bits),  int/integer (32 bits), and bigint (64 bits) correspond with the C types directly; real and double are mapped onto type float and double in C, respectively. For variable length strings, the file must have one C-based string value per line, terminated by a newline, and it is processed without escape character conversion. Fixed length strings are handled the same way. MonetDB assumes that all files are aligned, i.e. the i-th value in each file corresponds to the i-th record in the table.

All other types (including UTF8 and escaped characters) should be handled using the default COPY INTO language construct.

The file names must be absolute paths to the column input files. They must reside on the same file system as the database farm and they will replace the content of the table Tmp. The files with numeric data are moved into place to avoid copying. String input files are retained in their original location and should be removed by the user.

WARNING: currently, upon discovering an error in the input, the files are 'gone'.

Apache Pig

Apache Pig mk Fri, 07/19/2013 - 10:31

In many cases, the data that is supposed to be stored in MonetDB is created by a large-scale data aggregation process. Apache Pig is a popular tool for this task. There is a possibility of directly generating MonetDB binary column files using the MonetDB-Pig-Layer.

Inside a Pig script, this is used as follows:

STORE raw INTO './results/' USING nl.cwi.da.monetdb.loader.hadoop.MonetDBStoreFunc;
 
This will generate a SQL schema file, a set of column files, and a load command. Keep in mind that currently, only primitive Java types such as Integer, Long, Float, Double, String etc. are supported.