Hi,

 

I posted a question on StackOverflow regarding s specific performance issue. I’m not sure if the right people are monitoring S/O from time to time so I thought I’ll post it here too.

Here it is in case you do not want to follow the link:

 

I'm testing MonetDB as a solution for a data-science project. I have a table of 21K columns - all but three are features described as float (32bit) and 6.5M rows (which may or may not become larger, perhaps up to 20M rows).

My aim is to use the integrated Python on MonetDB to achieve the ability to train without exporting the data from the DB every time. In addition, queries on specific columns are necessary so the columnar storage can be a significant advantage. I have compiled MonetDB 11.31.13 to gain the embedded Python support. OS is CentOS 7. Storage is not SSD. 48 core server with ~300GB of memory. I created an (unique) index on the table (without analyze).

I noticed that when I

SELECT * FROM [TABLE_NAME] SAMPLE 50; it takes a long long time to complete. I then tried:

SELECT f1, f2, ..., f501 from [TABLE_NAME] SAMPLE 50;

SELECT f1, f2, ..., f1001 from [TABLE_NAME] SAMPLE 50;

SELECT f1, f2, ..., f2001 from [TABLE_NAME] SAMPLE 50;

...

SELECT * from [TABLE_NAME] SAMPLE 50;

I ran the queries locally with mclient and used time to measure the amount of time it took and I noticed two things:

1.   There is a period where a single core is taking 100% CPU. The more columns the longer it takes to complete. Only when it finishes I can see all cores working, data being consumed, etc... In addition, during that time, the query does not appear in the result of select * from sys.queue(); Eventually, the time needed to get 50 rows from the table was almost 4 hours.

2.   The amount of columns is doubled but between each step in the test the amount of time it takes to get a result is tripled.

So my questions is: Is this behaviour expected or does it reflect something I did wrong?

The data requested from the table should be around 4MB (50 * 21000 * 4Bytes), so this reflects a significant time waiting for such a small amount of data.

Help is appreciated!

 

Help is very much appreciated!