A good strategy for loading bulk data into monetdb
nikkne at gmx.ch
Mon Apr 25 11:45:06 CEST 2016
I'm trying to load large amounts of data into monetdb (migrating from mysql). Since I'm using Jul2015-SP4 release, I can't create local files and I can't use external tools, my approach is as following:
For each table, I partition the data based on the day, and then:
10 create one table in monetdb for each day, and a merge table that takes all these tables
20 fetch data from mysql (mysql -B -N -C ...)
30 _pipe_ data to mclient (that is invoked with mclient ... -c "COPY 2000000 RECORDS FROM STDIN INTO ..." -)
40 load data into corresponding table (the right day)
50 sleep some seconds
60 goto 20
Each daily table is around 6'000'000 rows, but I have some that have ~150'000'000 rows.
However, I noticed that when I'm doing this, I quickly run out of disk space, as monetdb is creating really big logs (in /data/monetdb/some_schema/sql_logs). Size of that folder is usually 600KB, but once the process of migration starts, it goes to several GBs. Not only that, but /data/monetdb/some_schema/bat also grows a lot.
The surprising part, and the motivation for this emails, is how well monetdb compresses the data. If I migrate only one day worth of data, and then restart monetdbd (once or twice) and run 'analyze some_schema', sql_logs size drops to normal levels, and I get a lot of reclaimed space (even in bat directory). I tried to put a 'sleep(120)' after each partition and running 'analyze some_schema' after each iteration, but couldn't get the same behaviour (and it is also quite slow). Is there a way to trigger the compression and reduce disk space overhead without restarting mserver, given the above constraints?
More information about the users-list