Hi All,

I just tried monetdb with large tables (600,000 rows by 50,000 columns). There are two issues:

1. performance was low, it took about 4 seconds to insert a row of record when the number of columns was large (50,000). However it was pretty fast (about 100 rows per second) when the column number was small (for example, 10).

2. serious errors occurred very often. For example, after inserting random number of records (50, for example), monetdb started to report errors, like:

monetdbd: internal error while starting mserver, please refer to the logs

So I opened the log file, and the error was:

2012-09-14 08:26:36 ERR merovingian[13730]: client error: unknown or impossible state: 4

The Makefile (it is also contained in the attachments) listed below will start database server, create a database farm, create a database, create a big table and then insert records to the big table. Here it is:

#=====start of Makefile=====================

ROW_NUM=600000

COLUMN_NUM=50000

COLUMN_NUM_MINUS_ONE=$(shell echo $(COLUMN_NUM)-1 | bc)

all: clean

monetdbd create ./mydbfarm

monetdbd start ./mydbfarm

monetdb create test_big_db

monetdb release test_big_db

make create_table_sql -s > create_big_table.sql

make insert_data_sql -s > insert_data.sql

mclient -d test_big_db create_big_table.sql

for i in $$(seq 1 $(ROW_NUM)); do \

mclient -d test_big_db insert_data.sql ;\

echo 'done inserting ' $$i ' rows'; \

done;

mclient -d test_big_db -s "select * from test_big_table" > test_big_table.txt

create_table_sql:

@echo 'START TRANSACTION;'

@echo 'CREATE TABLE "test_big_table" ('

@for i in $$(seq 1 $(COLUMN_NUM_MINUS_ONE)); do \

echo ' "gen'$$i'" DOUBLE,'; \

done

@echo ' "gen$(COLUMN_NUM)" DOUBLE'

@echo ');'

@echo 'COMMIT;'

insert_data_sql:

@echo -n 'INSERT INTO "test_big_table" VALUES (' ;

@for j in $$(seq 1 $(COLUMN_NUM_MINUS_ONE)) ; do \

echo -n '0.323, '; \

done;

@echo '0.324);';

insert_all_data_sql:

@echo 'START TRANSACTION;'

@for i in $$(seq 1 $(ROW_NUM)); do\

echo -n 'INSERT INTO "test_big_table" VALUES (' ;\

for j in $$(seq 1 $(COLUMN_NUM_MINUS_ONE)) ; do \

echo -n '0.323, '; \

done; \

echo '0.324);'; \

done

@echo 'COMMIT;'

clean:

-killall mserver5

-killall monetdbd

rm -rf ./mydbfarm

rm -rf create_big_table.sql test_big_table.txt insert_data.sql

#=====end of Makefile======================

Thank you very much !

Best wishes,
Joshua Shuai Yuan

On Thu, Sep 13, 2012 at 12:43 PM, Joshua Shuai Yuan <shuaiyuan.emory@gmail.com> wrote:

Sure, I will see what I can do.

Best wishes,
Joshua Shuai Yuan

On Wed, Sep 12, 2012 at 11:40 PM, Wang, Fusheng <fusheng.wang@emory.edu> wrote:

Hi Joshua,

The table we are managing is in the scale of 30K x 5K: 150M cells. If each table cell (double type) needs 8 bytes to represent (ideally), the space needed will be a couple of GB, or at most, in the scale of tens of GB. If that is the case, distributed setup may not be needed, as the data can be mostly cached by the database. MonetDB does support multi-cores and multi-disks, but setup across multiple machines is not supported based on my knowledge.

Even though MonetDB claims unlimited number of columns, we should still be cautious on what performance we can achieve for the queries we want to provide. A pilot study on this could provide us some guideline. Do you think you can setup MonetDB, and create a benchmark table we can do some performance study?

Interestingly, the large column issue is also discussed in a famous database blog:

http://www.dbms2.com/2011/03/13/so-how-many-columns-can-a-single-table-have-anyway/

One guy commented:

“Genomics models were the primary driver. these folks typically have short but wide datasets of the order of 300,000 columns.”

So we are not alone. And we have only a subset of the columns (30K versus 300K)? Then we are lucky!

Fusheng

From: Joshua Shuai Yuan [mailto:shuaiyuan.emory@gmail.com]
Sent: Wednesday, September 12, 2012 11:14 PM
To: Wang, Fusheng
Cc: Qin, Zhaohui
Subject: Re: Array database for large matrix

Hi Dr. Wang,

That's really a good news. Does it support distributed database? Or do we need distributed one?

Best wishes,
Joshua Shuai Yuan

On Wed, Sep 12, 2012 at 3:54 PM, Wang, Fusheng <fusheng.wang@emory.edu> wrote:

Hi guys,

It looks like the matrix structure can be nicely supported by array databases, like MonetDB. It supports unlimited number of columns for a table. It’s also open source.

http://www.monetdb.org/Home/Features

I know the group quite well, and I will chat with them on the use case to see if it’s a good fit.

Thanks,

Fusheng

This e-mail message (including any attachments) is for the sole use of
the intended recipient(s) and may contain confidential and privileged
information. If the reader of this message is not the intended
recipient, you are hereby notified that any dissemination, distribution
or copying of this message (including any attachments) is strictly
prohibited.

If you have received this message in error, please contact
the sender by reply e-mail message and destroy all copies of the
original message (including attachments).