Hi Stefan

I've reimported my 350M rows on a different machine (rather than my workstation) and I'm re-running the simple query:

sql>select count(*) from t_order2 where orderid = 'XXXX';

It's still taking hours on a RAID SSD, ~130GB RAM box and generating a multi GB thash file.  CPU usage is very low.

Whereas this one is taking seconds/millis:

sql>select count(*) from t_order2 where orderid = 'XXXX' and time > '1970-01-01';
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (6.1s)
sql>select count(*) from t_order2 where orderid = 'XXXX' and time > '1970-01-01';
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (465.389ms)

The explain output for the first query is slightly different from when I was running it locally:

sql>explain select count(*) from t_order2 where orderid = 'XXXX';
+-----------------------------------------------------------------+
| mal                                                             |
+=================================================================+
| function user.s6_2{autoCommit=true}(A0:str):void;               |
|     X_3 := sql.mvc();                                           |
|     X_7 := sql.bind(X_3,"sys","t_order2","orderid",0);          |
|     X_4:bat[:oid,:oid]  := sql.tid(X_3,"sys","t_order2");       |
|     X_37 := algebra.subselect(X_7,X_4,A0,A0,true,true,false);   |
|     (X_10,r1_10) := sql.bind(X_3,"sys","t_order2","orderid",2); |
|     X_38 := algebra.subselect(r1_10,A0,A0,true,true,false);     |
|     X_13 := sql.bind(X_3,"sys","t_order2","orderid",1);         |
|     X_39 := algebra.subselect(X_13,X_4,A0,A0,true,true,false);  |
|     X_15 := sql.subdelta(X_37,X_4,X_10,X_38,X_39);              |
|     X_17 := sql.projectdelta(X_15,X_7,X_10,r1_10,X_13);         |
|     X_18 := aggr.count(X_17);                                   |
|     sql.exportValue(1,"sys.L1","L1","wrd",64,0,6,X_18,"");      |
| end s6_2;                                                       |
| # optimizer.mitosis()                                           |
| # optimizer.dataflow()                                          |
+-----------------------------------------------------------------+

Is that thash file a lazily created index that'll speed up subsequent queries, or will I be creating it every time?  If the former, is there any way to specify up front that you want to create and maintain it?

Is there some documentation I can read about the on-disk format?  What's theap?  For my orderId column its much larger than rows x VARCHAR(6) (7?).  Is it some potentially sparse free store for mem mapping?

Is there any other information I can provide that might help find out why my first query is so slow?  I'm really keen to use monetdb in an app I'm developing, but being able to satisfy simple lookups as well as the lightning fast aggregations it does would be very useful!

-Will






On 29 November 2013 10:54, Stefan Manegold <Stefan.Manegold@cwi.nl> wrote:
Will,

did / can you run your first (slow) query twice in a row and report the speeds of both runs?

Thanks!
Stefan

----- Original Message -----
>
>
> On 29/11/13 11:36, Will Muldrew wrote:
> > Hi
> >
> > I've been trying out monetdb on a 350M row table (modern linux
> > workstation with 24GB RAM + SSD).  This table has a sorted TIMESTAMP
> > column and a second unsorted VARCHAR(6) column which is non-unique
> > (perhaps 50M distinct values fairly randomly distributed).
> >
> > This query is very slow, spends huge amount of time writing, and has
> > very low CPU (intermediate results?):
> >
> >    select count(*) from t_order where orderId = 'XXXXXX'
> >
> > This takes hours to complete.
> >
> > However, this query is a lot faster, even though the timestamp clause is
> > redundant (all my data is > '2012-01-01'):
> It decided it can use parallel processing effectively.
> Why it did not in the first query is unknown.
>
> >
> >    select count(*) from t_order where orderId = 'XXXXXX' and time >
> > '1970-01-01';
> >
> > The first run is slowish (10s of seconds - I guess because we're reading
> > the orderId column fully), however subsequent runs are sub-second.
> Yes, that is caching effect.
> >
> > Obviously I could write an app which puts the spurious timestamp clause
> > into the where, but it's still confusing and not ideal.  Is there
> > something I can do?
> >
> regards, Martin
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>

--
| Stefan.Manegold@CWI.nl | DB Architectures   (DA) |
| www.CWI.nl/~manegold/  | Science Park 123 (L321) |
| +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |

_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list