Skip to main content

CSV Bulk Loads

The COPY INTO command enables fast insertion of multiple tuples from a text file or the standard input. Each tuple in the input file is terminated by a record-separator (default '\n') and fields are separated by the field separator (default '|' ). The field values should adhere to the syntax for value literals. Alternative separators should adhere to the SQL lexical syntax for string values. A different NULL value representation can be specified using the NULL AS null_string option. Furthermore, the fields are optionally enclosed with a user defined quote character. The text file should use UTF-8 encoding if specified as file_name, and the same encoding as mclient is used if data are read FROM STDIN.

The input files must be accessible by the database server. Therefore, they must reside on or be accessible to the machine on which the database server is running. Also, the input files must be identified with an absolute path name. The STDIN file designator reads data streaming from the client application. The end of file (EOF) character signifies the end of the sequence.

The full syntax for the COPY INTO command is the following:

COPY [ int_val [ OFFSET int_val ] RECORDS ] INTO table_name
          [ '(' column_name ',' ... ')' ] FROM
          ( file_name ',' ... | STDIN ) [ '(' header [ STRING ] ',' ... ')' ]
          [ [USING] DELIMITERS  field_separator
            [',' record_separator [ ',' string_quote ]]]
          [ NULL [AS] null_string ] [ LOCKED ] [ BEST EFFORT ] [ NO CONSTRAINT ]

Note that the various features must be specified in the order shown above. For instance the query:

sql>COPY INTO tbl FROM STDIN BEST EFFORT NULL AS '';
syntax error, unexpected sql NULL, expecting SCOLON in: "copy into tbl from stdin best effort null"

is not accepted by the SQL parser because the BEST EFFORT feature is specified before the NULL AS feature.

COPY INTO from different input sources

COPY INTO is able to read data from different input sources, including standard input, a single CVS file, multiple CSV files and compressed CSV files.  For each example, we create the following table to start with:

sql>CREATE TABLE tbl (i INT, s STRING, d DECIMAL(5,2));
operation successful (3.719ms)

From standard input

The most straightforward way is to read data from the standard input STDIN:

sql>COPY INTO tbl FROM STDIN;
more>1|abc|0.1
more>2|def|0.2
more>3|ghi|0.3
more>^D
3 affected rows (116.271ms)

sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | abc  |    0.10 |
|    2 | def  |    0.20 |
|    3 | ghi  |    0.30 |
+------+------+---------+
3 tuples (3.633ms)

From a CSV file

Assume data are stored in a file named /tmp/example1.csv:

$ cat /tmp/example1.csv
1|dfs|3.1
2|ewe|3.23
3|erw|2.43

We can load the data using the following COPY INTO query:

sql>COPY INTO tbl FROM '/tmp/example1.csv';
3 affected rows (47.725ms)
sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
+------+------+---------+
3 tuples (3.287ms)

From multiple files

If the data are distributed across multiple files and have the same schema, one can load them all at once. Assume we have the following example files /tmp/example2.1.csv and /tmp/example2.2.csv:

$ cat /tmp/example2.1.csv
1|dfs|3.1
2|ewe|3.23
3|erw|2.43
$ cat /tmp/example2.2.csv
4|dfs|3.2
5|ewe|3.3
6|erw|2.3

we can use them in the COPY INTO command as follows:

sql>COPY INTO tbl FROM '/tmp/example2.1.csv', '/tmp/example2.2.csv';
6 affected rows (152.470ms)
sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
|    4 | dfs  |    3.20 |
|    5 | ewe  |    3.30 |
|    6 | erw  |    2.30 |
+------+------+---------+
6 tuples (2.362ms)

From compressed files

One can pass compressed files to COPY INTO to be loaded. Supported compressions include bzip2, gzip, lzma and snappy.

First, let us compress our example CSV file:

$ bzip2 -k /tmp/example1.csv
$ gzip -k /tmp/example1.csv
$ xz  -k /tmp/example1.csv
$ ls /tmp/example1.csv.*
/tmp/example1.csv.bz2    /tmp/example1.csv.gz /tmp/example1.csv.xz

Now, we can load both files in one COPY INTO statement:

sql>COPY INTO tbl FROM '/tmp/example1.csv.bz2', '/tmp/example1.csv.gz', '/tmp/example1.csv.xz';
9 affected rows (93.635ms)
sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
+------+------+---------+
9 tuples (1.903ms)

Specifying the number of records

It is strongly recommended to specify the maximum number of records to be inserted. This allows the database server to allocate enough space for the table in advance, so that it can avoid guessing and subsequent extending of table space, which may involve potentially expensive copying. If the exact number is unknown, a (slight) overestimation is generally better for performance than an underestimation.

If the number of actual records in the input source is smaller than the number of records specified, COPY INTO will simply stop at the end of input.

If the number of actual records in the input source is larger than the number of records specified, COPY INTO will only load the number of specified records.

The example below shows how to specify the number of records (the earlier example CSV file is used). First we give an overestimation, then we give an underestimation:

sql>COPY 4 RECORDS INTO tbl FROM '/tmp/example1.csv';
3 affected rows (46.254ms)
sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
+------+------+---------+
3 tuples (3.408ms)
sql>DELETE FROM tbl;
3 affected rows (2.488ms)
sql>COPY 2 RECORDS INTO tbl FROM '/tmp/example1.csv';
2 affected rows (44.747ms)
sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
+------+------+---------+
2 tuples (4.205ms)

Specifying an offset

A portion of the input file can be skipped using the OFFSET feature. Specifying an offset n directs COPY INTO to start reading at record n. The offset of the first record is 1. More formally, the offset directive counts n unquoted record separators. Here we use a file named /tmp/example2.csv:

$ cat /tmp/example2.csv
i|s|d
1|dfs|3.1
2|ewe|3.23
3|erw|2.43

Since the first line of the file does not have the same schema as the table, it must be ignored. So, we specify that the COPY INTO should start at the second record (i.e. 'OFFSET 2'):

sql>COPY 3 OFFSET 2 RECORDS INTO tbl FROM '/tmp/example2.csv';
3 affected rows (46.947ms)
sql>SELECT * from tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | dfs  |    3.10 |
|    2 | ewe  |    3.23 |
|    3 | erw  |    2.43 |
+------+------+---------+
3 tuples (3.530ms)

Specifying columns to load and their order

By default, a CSV file must contain data for all columns specified in a table, and in the same order as when the table was created. Additional syntax is supported to load data from CSV files that do not satisfy these requirements. Here we show with several examples how this feature works. 

Scenario 1: skip column in the input file

Assume we have the following input file:

$cat /tmp/cars.csv
Year|Make|Model
1997|Ford|E350
2000|Mercury|Cougar

However, when loading this file, we want to skip the second column containing the "Make" information:

sql>CREATE TABLE cars (y STRING, make STRING, model STRING);
operation successful (4.148ms)
sql>COPY 2 OFFSET 2 RECORDS INTO cars(y, model) FROM '/tmp/cars.csv';
2 affected rows (52.367ms)
sql>SELECT * FROM cars;
+------+------+--------+
| y    | make | model  |
+======+======+========+
| 1997 | null | E350   |
| 2000 | null | Cougar | 
+------+------+--------+
2 tuples (2.141ms)

Scenario 2: missing column and out of order

In the following CSV file, data for the column "Model" is missing, while the data for the columns "Year" and "Make" are in the reverse order as specified when the table "cars" was created:

$ cat /tmp/cars2.csv
Make|Year
Ford|1997
Mercury|2000

The following example shows how we can tell COPY INTO which columns are contained in the input CSV file and in which order:

sql>COPY 2 OFFSET 2 RECORDS INTO cars(y,make) from '/tmp/cars2.csv'(make,y);
2 affected rows (33.402ms)
sql>select * from cars;
+------+---------+-------+
| y    | make    | model |
+======+=========+=======+
| 1997 | Ford    | null  |
| 2000 | Mercury | null  | 
+------+------+----------+
2 tuples (1.968ms)

Customise delimiters

The COPY INTO command is versatile enough to accept records in many formats. The basic idea is that each field of the record is separated from other records using a specified character string, and each record is separated from other records using another character string. Moreover since string values can contain either of those characters strings, they might need to be quoted, so as to be read correctly.

The user can specify the format of the file with

[[USING ] DELIMITERS field_separator [, record_separator [, string_quote ]]]

syntax. The input syntax should then comply to the following grammar:

[[[quote] [[escape]char]* [quote]] feldspar]* record separator

Quote characters in quoted fields may be escaped with a backslash. Field and record separators can be embedded in quoted fields.

By default, fields are separated by the pipe character ('|'), records by the new line ('\n') character.

The queries below show how to i) change the default field separator with a single character; ii) change the default field separator with a string of two characters; and iii) change the default string quote:

sql>COPY INTO tbl FROM STDIN DELIMITERS ',';
more>1,abc,0.1
more>2,def,0.2
more>2 affected rows (41.766ms)

sql>COPY INTO tbl FROM STDIN DELIMITERS ';;';
more>3;;ghi;;0.03
more>4;;jkl;;0.04
more>2 affected rows (39.128ms)

sql>COPY INTO tbl FROM STDIN DELIMITERS ';;','\n','|';
more>5;;|klm|;;0.05
more>6;;|opq|;;0.06
more>2 affected rows (37.899ms)

sql>SELECT * FROM tbl;
+------+------+---------+
| i    | s    | d       |
+======+======+=========+
|    1 | abc  |    0.10 |
|    2 | def  |    0.20 |
|    3 | ghi  |    0.03 |
|    4 | jkl  |    0.04 |
|    5 | klm  |    0.05 |
|    6 | opq  |    0.06 |
+------+------+---------+
6 tuples (3.281ms)

Customise representation of the NULL value

By default the character sequence "NULL" (case insensitive) represent the NULL value:

sql>CREATE TABLE str (s STRING);
operation successful (2.712ms)
sql>COPY INTO str FROM STDIN;
more>NULL
more>null
more>Null
more>nUll
more>4 affected rows (50.040ms)

sql>SELECT * FROM str;
+------+
| s    |
+======+
| null |
| null |
| null |
| null |
+------+
4 tuples (4.095ms)
sql>SELECT * FROM str WHERE s = 'null';
+---+
| s |
+===+
+---+
0 tuples (1.817ms)
sql>SELECT * FROM str WHERE s IS NULL;
+------+
| s    |
+======+
| null |
| null |
| null |
| null |
+------+
4 tuples (2.491ms)

The NULL AS feature allows a user to specify an alternative string as the representation of the NULL value. Whenever the alternative string is read during the loading process, the NULL value is inserted into the table. Any valid string can be specified as the representation of the NULL value.

In the example below, the first COPY INTO treats an empty string ('') as the NULL value, while the second COPY INTO uses a long text to represent the NULL value, which typically can be used to prevent any text in the input to be treated as the NULL value. The keyword AS is optional.

sql>CREATE TABLE nullas (i INT, s STRING);
operation successful (9.010ms)
sql>COPY INTO nullas FROM STDIN NULL AS '';
more>1|
more>2|null
more>3|"NuLL"
more>^D
3 affected rows (40.528ms)
sql>COPY INTO nullas FROM STDIN NULL 'this-string-does-not-exist-in-the-input';
more>4|
more>5|null
more>6|NULL
more>^D
3 affected rows (43.968ms)

sql>SELECT * FROM nullas;
+------+--------+
| i    | s      |
+======+========+
|    1 | null   |
|    2 | null   |
|    3 | "NuLL" |
|    4 |        |
|    5 | null   |
|    6 | NULL   |
+------+--------+
3 tuples (1.999ms)
sql>SELECT * FROM nullas WHERE s IS NULL;
+------+------+
| i    | s    |
+======+======+
|    1 | null |
+------+------+
1 tuple (4.391ms)
sql>SELECT * FROM nullas WHERE s = 'null';
+------+------+
| i    | s    |
+======+======+
|    2 | null |
|    5 | null |
+------+------+
1 tuple (0.969ms)
sql>SELECT * FROM nullas WHERE s = '"NuLL"';
+------+--------+
| i    | s      |
+======+========+
|    3 | "NuLL" |
+------+--------+
1 tuple (3.484ms)
sql>SELECT * FROM nullas WHERE s = 'NULL';
+------+------+
| i    | s    |
+======+======+
|    6 | NULL |
+------+------+
1 tuple (3.484ms)

Ignore errors in the input with the BEST EFFORT mode

Normally if there is an error in the input, such as a missing field, COPY INTO will fail. Using the BEST EFFORT mode, COPY INTO continues parsing the input, and writes all the records that failed in the sys.rejects table.

sql>CREATE TABLE cars (y STRING, make STRING, model STRING);
operation successful (0.734ms)
sql>COPY INTO cars FROM STDIN DELIMITERS ',','\n' BEST EFFORT;
more>2000,
more>2004,Honda,Pilot
more>2001,Honda,Civic
more>,,,
more>3 affected rows (28.604ms)
sql>SELECT * FROM cars;
+------+-------+-------+
| y    | make  | model |
+======+=======+=======+
| 2004 | Honda | Pilot |
| 2001 | Honda | Civic |
|      |       |       |
+------+-------+-------+
3 tuples (2.233ms)

Above example shows that only the valid records are retained. In this scenario the first record is omitted.

sql>SELECT * FROM rejects;
+-------+-------+------------------------+--------+
| rowid | fldid | message                | input  |
+=======+=======+========================+========+
|     0 |     1 | Column value 2 missing | 2000,  |
+-------+-------+------------------------+--------+

Tips to boost performance

Here are several tips to possibly further improve the performance of bulk data loading:

  • Add constraints after data loading: if one needs to load multiple chunks of data, it is advisable to add the integrity constraints to the table after all files of this bunch have been loaded. The ALTER statements perform bulk integrity checking, which is often more efficiently than conducting the check after each small amount of data are loaded. When adding more data to an existing table, one can temporarily remove existing constraints on the table, load the new data and finally re-add the constraints.
  • LOCKED mode for single user: in many bulk loading situations, the original data file can be saved as a backup or recreated for disaster handling. This relieves the database system from having to prepare for recovery and to save significant storage space. The LOCKED qualifier can be used in this situation to skip the logging operation that is done by default. (WARNING: this feature must be used with great care. When using the LOCKED mode, it is the responsibility of the database user to make sure that there is currently only a single client connection accessing the database. The database server will not check this.)