Loading from Binary Files

This section applies to the COPY BINARY INTO command as found MonetDB version Jul2021 and onward. Earlier versions also had a COPY BINARY INTO but it worked differently. The most notable differences are that text was newline terminated rather than NUL terminated, that the list of supported datatypes was significantly shorter and that there was no control over byte order.

Usually the preferred way of populating a database is by importing a CSV file using the COPY INTO command. This command also supports variant file formats such as tab separated files and fixed width records. The primary advantage of this way of data interchange is that almost every application has a way of producing and/or consuming these types of file. A disadvantage is that even though MonetDB's CSV parser is highly tuned, converting from a textual format to the internal binary format can still be time consuming.

If very large amounts of data need to be inserted quickly, COPY BINARY INTO can be a good alternative. COPY BINARY INTO allows you to skip the parsing overhead by passing the data in a binary form that MonetDB can ingest very efficiently. The main down side is that it is entirely MonetDB specific and probably needs custom application code to generate the data in the appropriate form.

Syntax

COPY [ BIG ENDIAN | LITTLE ENDIAN | NATIVE ENDIAN ] BINARY
INTO [ schema_name . ] table_name
     [ '(' column_name [ ',' column_name [ ... ] ] ')' ]
FROM file_name [ ',' file_name [ ... ] ]
[ ON CLIENT | ON SERVER ]

With COPY BINARY INTO, every column is read from a a separate file. The number and order of file names given must match the number of columns listed, or the number of columns of the table if no column list is present. As with the regular COPY INTO, the data can be read on the server or on the client. Because every column has its own file, no delimiters can be configured and there also is no BEST EFFORT mode.

Every column is read from a separate file. For fixed width data such as numbers, the data is simply the concatenation of the binary representation of each value. Both Big Endian and Little Endian byte orders are supported, see below.

Text is represented as NUL terminated UTF-8 encoded strings.

Temporal types are represented as structured values containing a small integer for each field.

For each data type there is a special binary pattern to represent NULL values.

Example

Assume we have the following table definition:

CREATE TABLE foo(i INT, t TEXT);

We will use Python's struct module to generate some binary data to fill it. Note that for large amounts of data this might not be very efficient as it first materializes the data as a large in-memory Python array.

#!/usr/bin/env python3
import struct
open('col_i.bin', 'wb').write(struct.pack('<4i', 42, 43, 44, 45))
open('col_t.bin', 'wb').write(b'foo\x00bar\x00baz\x00quux\x00')

This program creates two files: col_i.bin is a 16-byte file containing four 4-byte/32-bits integers:

┌─────────────────────────┬─────────────────────────┐
│ 2a 00 00 00 2b 00 00 00 ┊ 2c 00 00 00 2d 00 00 00 │
└─────────────────────────┴─────────────────────────┘

The other file is a 17-byte file containing the four strings, each terminated with a NUL byte:

┌─────────────────────────┬─────────────────────────┬────┐
│ 66 6f 6f 00 62 61 72 00 ┊ 62 61 7a 00 71 75 75 78 ┊ 00 │
│ f  o  o  \0 b  a  r  \0 ┊ b  a  z  \0 q  u  u  x  ┊ \0 │
└─────────────────────────┴─────────────────────────┴────┘

To copy this data into MonetDB, issue the command

COPY LITTLE ENDIAN BINARY
INTO foo
FROM 'col_i.bin', 'col_t.bin' ON CLIENT;

Numeric data layout

First a word about byte order. On modern systems, bytes are 8 bits wide. That means that to store a 16-bit number we need 2 bytes. Take for example the number 2021 = 7 x 256 + 229. It is stored as a combination as a high byte 7 and a low byte 229. There are two obvious ways to store the number 2021 in a byte file:

  1. on Little Endian systems, the low byte comes first: [ 229 | 7 ]. Intel/AMD-based computers systems and most ARM-based computer systems use a little endian byte order.

  2. on Big Endian systems, the high byte comes first: [ 7 | 229 ]. Many older computer systems such as IBM mainframes and SPARC/MIPS systems use a big endian byte order. The Java virtual machine also prefers big endian numbers, see for example Java's DataOutputStream.

In the first few words of the COPY BINARY INTO statement you can indicate the byte order of the incoming data: BIG ENDIAN, LITTLE ENDIAN and NATIVE ENDIAN, where native endian means the preferred byte order of the system MonetDB is running on. When omitted, native endian is assumed.

The byte width of the file for each numeric column depends on the data type:

SQL data typebyte width
tinyint1
smallint2
int4
bigint8
hugeint16, if supported
decimal(1..2,X)1
decimal(3..4,X)2
decimal(5..9,X)4
decimal(10..18,X)8
decimal(19..38,X)16, if supported
float(24)4
float(53)8

The NULL representation for the integers and decimals has 0x80 in the highest byte and 0 in the other bytes. When interpreted as a signed integer, this byte sequence is just outside the range for the corresponding SQL integer. For example, when reading little endian smallint values, the bytes [ 00 80 ] are interpreted as the signed integer -32768 which falls outside smallint's supported range of -32767..+32767.

The NULL representation for floating point values is any not-a-number (NaN) value.

Text data layout

For text columns (VARCHAR(*), STRING, TEXT) and text-like datatypes such as URL, the file should contain the concatenated values, each followed by a NUL byte. That is, the values are NUL-terminated, not NUL-separated. The text encoding MUST be UTF-8.

The NULL representation for textual columns consists of the single byte \x80. This is not valid UTF-8 so it cannot be confused with a value.

Temporal types

Temporal types are represented as small structured values consisting of several small integers holding the day, the hour, the month, etc. The exact layout is defined in the C header file monetdb/copybinary.h provided with MonetDB and reproduced below:

typedef struct {
	uint8_t day;
	uint8_t month;
	int16_t year;
} copy_binary_date; // natural size: 32, natural alignment: 16

typedef struct {
	uint32_t ms;
	uint8_t seconds;
	uint8_t minutes;
	uint8_t hours;
	uint8_t padding; // implied in C, explicit elsewhere
} copy_binary_time;		 // natural size: 64, natural alignment: 32

typedef struct {
	copy_binary_time time;
	copy_binary_date date;
} copy_binary_timestamp; // natural size: 96, natural alignment: 32

The day and month fields are 1-based.

Any invalid value, such as month=32, is treated as a NULL.