Bug 6753

Summary: SQL COPY INTO FROM compressed (bz2,xz,gz) file fails
Product: SQL Reporter: Stefan Manegold <Stefan.Manegold>
Component: allAssignee: SQL devs <bugs-sql>
Status: NEW ---    
Severity: normal    
Priority: Normal    
Version: 11.33.3 (Apr2019)   
Hardware: Other   
OS: Linux   

Description Stefan Manegold cwiconfidential 2019-08-20 13:56:17 CEST
trying to bulk-load from a compressed (bz2,xz,gz) CSV file using

COPY OFFSET 3 INTO "data" FROM '$PWD/data.csv.bz2' DELIMITERS ',','\r\n','' NULL AS '';

fails with

"Failed to import table 'data', incomplete record at end of file"


The problem occurs with Apr2019_release (Apr2019_3), Apr2019_SP1_release (APR2019_7) and the latest head of the Apr2019 branch (changeset ad472c9f29d3).
Comment 1 Stefan Manegold cwiconfidential 2019-08-20 14:03:03 CEST
Please find a test at http://homepages.cwi.nl/~manegold/Bug-6753.tar
Comment 2 Sjoerd Mullender cwiconfidential 2019-08-20 14:15:33 CEST
Try delimiter '\n' instead of '\r\n'.
Comment 3 Stefan Manegold cwiconfidential 2019-08-20 16:13:51 CEST
Thanks!

Indeed, all MonetDB decompression code seems to silently replace '\r\n' (CRLF) by '\n' (LF) during decompression (at least on non-Windows systems, i.e., on systems where '\n' (LF) is the standard end-of-line marker, rather than '\r\n' (CRLF)).

The plain (non-compressed) CSV reader, however, seems to not do such replacement.

Consequently, when loading from CSV files that do contain '\r\n' (CRLF) line-endings,
MonetDB's copy into need to specify
delimiter '\r\n' when loading from a plain (uncompressed) CSV file, and
delimiter   '\n' when loading from a compressed CSV file.

Otherwise,
when using '\r\n' with compressed files, the error reported here occurs, and
when using   '\n' on plain (uncompressed) CSV files, '\r' occurs in the loaded data (provided the last column is a clob/string/varchar/char(-like; i.e., non-numerical) column).

Please take this as an implicit documentation of this current "inconsistency" in the MonetDB code base.