Data extracted from external applications is often in CSV form (Comma Separated Value),
or a variant that uses another delimiter such as a semicolon ;
, a
vertical bar |
or a TAB character. MonetDB provides the COPY INTO statement to
load data from this kind of file.
Loading data using COPY INTO is much faster than inserting it with a series of INSERT statements. This is because loading a line of CSV data is much more efficient than parsing and executing an INSERT statement, and also because execution of a SQL script is inherently sequential while the work of COPY INTO can be spread over multiple CPU cores.
Example:
COPY INTO mytable
FROM '/path/to/my/data.csv' ON CLIENT
USING DELIMITERS ',', E'\n', '"';
The details of the supported file formats are described in section Supported file formats. If an error occurs during loading, details of the error are stored in a special system table and loading stops unless the BEST EFFORT clause is set. For details, see section Error Handling.
COPY [ n RECORDS ] [ OFFSET k ]
INTO [ schema_name '.' ] table_name [ '(' column_name [ ',' ... ] ')' ]
FROM {
file_name [ ',' ... ] [ '(' header [ ',' ... ] ) ] [ ON SERVER ]
| file_name [ ',' ... ] [ '(' header [ ',' ... ] ) ] ON CLIENT
| STDIN [ '(' header [ ',' ... ] ) ]
}
[ [ USING ] DELIMITERS field_sep [ ',' record_sep [ ',' string_quote ] ] ]
[ DECIMAL [AS] decimal_sep ',' thousands_sep ]
[ [ NO ] ESCAPE ]
[ NULL [ AS ] null_string ]
[ BEST EFFORT ]
[ FWF '(' width [ ',' ... ] ')' ]
The start of the statement has 5 possible forms. In each case, n is the maximum number of records to read, and k is the line number of the first record to read.
COPY INTO
COPY n RECORDS INTO
COPY OFFSET k INTO
COPY n OFFSET k RECORDS INTO
COPY n RECORDS OFFSET k INTO
If n is given, MonetDB will not load more than n records. It also uses the number to preallocate storage, so if you know in advance there will be (at most) n records, letting MonetDB know may improve performance.
The line number k starts counting at 1. This means that values 0 and 1 both mean "start reading at the top of the file", 2 means "skip one line", 3 means "skip two lines", etc.
Many CSV files start with a header line giving the names of the columns. The OFFSET clause can be used to skip this header line.
INTO [ schema_name '.' ] table_name [ '(' column_name [ ',' ... ] ')' ]
The INTO table_name
clause takes an optional list of column names.
If given, the columns listed are loaded from the input file while the
other columns are filled with default values.
FROM {
file_name [ ',' ... ] [ '(' header [ ',' ... ] ) ] [ ON SERVER ]
| file_name [ ',' ... ] [ '(' header [ ',' ... ] ) ] ON CLIENT
| STDIN [ '(' header [ ',' ... ] ) ]
}
The FROM clause has three forms, each with an optional header list. Though not indicated in the grammar above, it is allowed to enclose the file names in parentheses. If multiple file names are given, the data from those file names is concatenated while reading. The RECORDS option gives the total number of records read from all files together, but if an OFFSET clause is present, this offset is applied to each file in turn.
Especially on Windows it is recommended to write the file names using raw strings as this allows you to write a Windows path without doubling the backslashes. Alternatively, you can simply write forward slashes instead of backslashes. This means the following are all equivalent:
COPY INTO mytable FROM R'c:\dumps\data.csv';
COPY INTO mytable FROM E'c:\\dumps\\data.csv';
COPY INTO mytable FROM 'c:\\dumps\\data.csv';
COPY INTO mytable FROM 'c:/dumps/data.csv';
The header list can be used to give names to the fields in the CSV file. In combination with with the column list given in the INTO clause, this can be used to reorder the data from the file, or to use only a subset of the fields. For example,
COPY INTO foo(x, y) FROM 'file' (y, dummy, x)
reads a three-column CSV file, storing the third CSV column in the first table column, the first CSV column in the second table column, and ignoring the middle CSV column.
It is important to note that if no header list is given, the header names are taken from the columns of the destination table even if a column list was specified in the INTO clause. This means for example that
CREATE TABLE foo(i INT, j INT, k INT);
COPY INTO foo(i, j) FROM 'file.csv';
will assume that file.csv has three columns i, j and k, not two as you might expect.
With ON SERVER, which is the default, the file name must be an absolute path on the system on which the database server (mserver5) is running. The server must have read privileges on the file. Note that for example if MonetDB is running on a Linux server with SELinux enabled, it might not be easy to find a location on the server where you as a user have write access and where MonetDB has read access.
When the ON CLIENT is set, the server will not itself open the file, but will ask the client to open the file for it. This is more flexible than ON SERVER but the client application needs to be specifically written to support this. Currently this is possible for C/C++-based applications written using the Mapi library, Java applications using the the JDBC driver, and Python applications using pymonetdb.
With ON CLIENT, the file name need not be an absolute path. In fact, because the client application can interpret the file name however it pleases, it does not need to be a real file at all. With the implementation built into jdbcclient, the file name is interpreted as an absolute path or relative to the current working directory of the client. With the implementation built into mclient, the file is first tried relative to the current working directory of the client, and if that fails and a file is known from which the query is being read, relative to that file.
With FROM STDIN, the data follows directly under the COPY INTO statement. This form is only really usable from mclient and is used by msqldump.
sql>COPY 3 RECORDS INTO foo FROM STDIN;
more>1|one
more>2|two
more>3|three
3 affected rows
sql>
If no n RECORDS
is specified, loading stops at the first empty line.
This can cause confusion if the table only has a single column and the
input may contain empty fields.
[ [ USING ] DELIMITERS field_sep [ ',' record_sep [ ',' string_quote ] ] ]
This clause allows you to set the field separator, record terminator and quote character as described in section Supported file formats. It is recommended to write the record separator as an E-string to make the meaning of the backslash explicit:
COPY INTO mytable FROM 'file.csv' ON CLIENT
USING DELIMITERS ',', E'\n', '"';
Note that record separator \n
also matches \r\n
as commonly used on
Windows systems. Explicitly specifying \r\n
yields a warning and is treated
equivalent to \n
.
This clause allows you to configure which decimal- and thousands separator will
be used for DECIMAL, TIMESTAMP and REAL/DOUBLE columns. The default decimal separator
is the period/dot .
. There is no default thousands separator.
Example:
CREATE TABLE foo(d DECIMAL(8,3));
COPY 1 RECORDS INTO foo FROM STDIN DECIMAL AS ',', '_';
1_234,567
SELECT * FROM foo;
-- Result:
+------------+
| d |
+============+
| 1234.567 |
+------------+
ESCAPE
NO ESCAPE
The ESCAPE clause can be used to disable or enable backslash escapes as described in section Supported file formats.
This clause indicates how NULL is represented in the CSV file. By default, NULL values in the CSV are indicated by the word "NULL", but another frequently used convention is to use the empty string. This can be configured as follows:
COPY INTO table FROM 'file'
USING DELIMITERS ',', E'\n', '"'
NULL AS '';
This clause enables BEST EFFORT mode as described in section Error Handling.
FWF ( width1, width2, ... )
This clause can be used instead of the USING DELIMITERS clause to indicate that the fields in the file have fixed widths and are not separated by a separator character. It lists the widths of the fields. Leading and trailing whitespace in the fields is stripped.
The details of the file format can be configured using the USING DELIMITERS clause. As an alternative to column separators, fixed width columns can be used using the FWF clause.
The input file must use the UTF-8 character encoding. On Windows this is often called a code page. Note that legacy encodings such as latin1 and UTF-16 are not supported, the file must be UTF-8.
To convert file data to UTF-8 character encoding you can use the
iconv command-line program.
Example:iconv -f ISO−8859−1 -t UTF−8 "my-data-file.csv" -o "my-data-file.utf8.csv"
If you do not know the encoding of a source data file you can use the linux file command:file -i "my-data-file.csv"
It will show the detected file type and character set encoding.
The input consists of records each terminated by a a record terminator. By default, every line is a record. Both Windows-style line endings and Unix/Linux/macOS-style line endings are allowed.
The fields in a record are separated or terminated by a column separator
character. By default this is the vertical bar |
. By "separated or
terminated" we mean that it is allowed to have an additional column separator
character at the end of the line. For example, when loading two columns both the
following lines are valid:
1|one
2|two|
If the column separator can occur within the fields, a quote character can be
used to indicate the start and end of the field. The most common choice of quote
character is the double quote "
. By default, no quote character is used. If
the quote character itself occurs in the value it can either be doubled or
escaped with a backslash:
1,"Text containing a doubled "" quote character"
2,"Text containing an escaped \" quote character"
Backslash escapes are enabled by default but can be disabled using the NO ESCAPE clause. Backslash escapes also supports octal, hexadecimal and unicode escapes but it is not recommended to use those:
"a doubled "" quote"
"a backslash-escaped \" quote"
"a hex-escaped \x22 quote"
"an oct-escaped \042 quote"
"a unicode-escaped \u0022 quote and \u2665 heart ♥"
"a long unicode-escaped \U00000022 quote and \U0001F308 rainbow 🌈"
If an error occurs while loading, for example because of malformed data, the
row, column and cause of the error are retained in a special system table which
can be accessed as sys.rejects
. The contents of this table persist until the
end of the session, or until sys.clearrejects()
is called.
For example, suppose file data.csv contains the following three lines:
Einstein|1879
Newton|1643
Eratosthenes|276 BC
The SQL script
CREATE TABLE foo(i INT, t TEXT);
CREATE TABLE foo(name VARCHAR(20), birth INT);
COPY INTO foo FROM 'j.csv' ON CLIENT;
Will fail:
Failed to import table 'foo', line 3 field birth 'int' expected in '276 BC'
SELECT * FROM sys.rejects;
+-------+-------+-----------------------------------------------+----------------------+
| rowid | fldid | message | input |
+=======+=======+===============================================+======================+
| 3 | 2 | line 3 field birth 'int' expected in '276 BC' | Eratosthenes|276 BC |
+-------+-------+-----------------------------------------------+----------------------+
Normally, loading stops at the first error so every COPY INTO attempt adds at
most one line to the table. However, if you use BEST EFFORT the number of
entries added to sys.rejects is essentially unlimited. Do not forget to call
sys.clearrejects()
to release these system resources.