Bug 6753 - SQL COPY INTO FROM compressed (bz2,xz,gz) file fails
Summary: SQL COPY INTO FROM compressed (bz2,xz,gz) file fails
Status: NEW
Alias: None
Product: SQL
Classification: Unclassified
Component: all (show other bugs)
Version: 11.33.3 (Apr2019)
Hardware: Other Linux
: Normal normal
Assignee: SQL devs
Depends on:
Reported: 2019-08-20 13:56 CEST by Stefan Manegold
Modified: 2019-08-20 16:13 CEST (History)
0 users


Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Manegold cwiconfidential 2019-08-20 13:56:17 CEST
trying to bulk-load from a compressed (bz2,xz,gz) CSV file using

COPY OFFSET 3 INTO "data" FROM '$PWD/data.csv.bz2' DELIMITERS ',','\r\n','' NULL AS '';

fails with

"Failed to import table 'data', incomplete record at end of file"

The problem occurs with Apr2019_release (Apr2019_3), Apr2019_SP1_release (APR2019_7) and the latest head of the Apr2019 branch (changeset ad472c9f29d3).
Comment 1 Stefan Manegold cwiconfidential 2019-08-20 14:03:03 CEST
Please find a test at http://homepages.cwi.nl/~manegold/Bug-6753.tar
Comment 2 Sjoerd Mullender cwiconfidential 2019-08-20 14:15:33 CEST
Try delimiter '\n' instead of '\r\n'.
Comment 3 Stefan Manegold cwiconfidential 2019-08-20 16:13:51 CEST

Indeed, all MonetDB decompression code seems to silently replace '\r\n' (CRLF) by '\n' (LF) during decompression (at least on non-Windows systems, i.e., on systems where '\n' (LF) is the standard end-of-line marker, rather than '\r\n' (CRLF)).

The plain (non-compressed) CSV reader, however, seems to not do such replacement.

Consequently, when loading from CSV files that do contain '\r\n' (CRLF) line-endings,
MonetDB's copy into need to specify
delimiter '\r\n' when loading from a plain (uncompressed) CSV file, and
delimiter   '\n' when loading from a compressed CSV file.

when using '\r\n' with compressed files, the error reported here occurs, and
when using   '\n' on plain (uncompressed) CSV files, '\r' occurs in the loaded data (provided the last column is a clob/string/varchar/char(-like; i.e., non-numerical) column).

Please take this as an implicit documentation of this current "inconsistency" in the MonetDB code base.