Hi Will,

thanks for the update!

I suspect that the first query triggers a build of a hash index for the point (equality) predicate on the orderid column.
The index will be used for subsequent queries using the same column --- at least as long as the server is running;
for now, indexes are not made persistent and will not survive a server restart.

Apparently, building such index for a large string (varchar) column appears to be quite expensive ---
we'll have to investigate whether/where there is room for improvement.

Having said that, I'm wonderning whether your (unique?) orderid's are indeed alphanumeric,
and thus require to be of type varchar, or rather purely numeric, and thus could be of type
integer?
The latter should show better performance and require less storage ...

With you second query, out optimizer favors the inequality perdicate on time,
but cause that of type timestamp, with is basically an integer.
It will thus first scan the time column (unless the loaded data is ordered,
then it wil even use binary search), and only then refine the result with the
equality predicate on orderid (only for those tuples that qualified the time predicate).
Thus, in this case, a hash index would not be of any benefit (in the future),
and thus will not be built automatically on the fly.

Storage footprint:
For fixed-width (numerical) types, MonetDB basically uses a C array if the respective type
to store the data; hence, footprint of a column is #tuples * width_of_type .
For variable width data types (basically all variants of strings, also if defined with fixed length)
MonetDB uses a dictonary-like encoding, taking 1-8 byte (depeding on number of distinct values
and length of values) for an internal dictionary index, plus the actual values (the latter possibly
best-effort duplicate eliminated --- the details are quite "complicated" and for now only 'documented'
in the code ...).

Hope this explains things for now.

Would still be great, if you could let the first run of the first query finish,
and then run it a second time (without restarting the server) and report both time
--- provided you have the time and resources to block your machine for some
more time ...

Best,
Stefan


Will Muldrew <will.muldrew@gmail.com> wrote:
Hi Stefan

I've reimported my 350M rows on a different machine (rather than my workstation) and I'm re-running the simple query:

sql>select count(*) from t_order2 where orderid = 'XXXX';

It's still taking hours on a RAID SSD, ~130GB RAM box and generating a multi GB thash file.  CPU usage is very low.

Whereas this one is taking seconds/millis:

sql>select count(*) from t_order2 where orderid = 'XXXX' and time > '1970-01-01';
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (6.1s)
sql>select count(*) from t_order2 where orderid = 'XXXX' and time > '1970-01-01';
+------+
| L1   |
+======+
|    0 |
+------+
1 tuple (465.389ms)

The explain output for the first query is slightly different from when I was running it locally:

sql>explain select count(*) from t_order2 where orderid = 'XXXX';
+-----------------------------------------------------------------+
| mal                                                             |
+=================================================================+
| function user.s6_2{autoCommit=true}(A0:str):void;               |
|     X_3 := sql.mvc();                                           |
|     X_7 := sql.bind(X_3,"sys","t_order2","orderid",0);          |
|     X_4:bat[:oid,:oid]  := sql.tid(X_3,"sys","t_order2");       |
|     X_37 := algebra.subselect(X_7,X_4,A0,A0,true,true,false);   |
|     (X_10,r1_10) := sql.bind(X_3,"sys","t_order2","orderid",2); |
|     X_38 := algebra.subselect(r1_10,A0,A0,true,true,false);     |
|     X_13 := sql.bind(X_3,"sys","t_order2","orderid",1);         |
|     X_39 := algebra.subselect(X_13,X_4,A0,A0,true,true,false);  |
|     X_15 := sql.subdelta(X_37,X_4,X_10,X_38,X_39);              |
|     X_17 := sql.projectdelta(X_15,X_7,X_10,r1_10,X_13);         |
|     X_18 := aggr.count(X_17);                                   |
|     sql.exportValue(1,"sys.L1","L1","wrd",64,0,6,X_18,"");      |
| end s6_2;                                                       |
| # optimizer.mitosis()                                           |
| # optimizer.dataflow()                                          |
+-----------------------------------------------------------------+

Is that thash file a lazily created index that'll speed up subsequent queries, or will I be creating it every time?  If the former, is there any way to specify up front that you want to create and maintain it?

Is there some documentation I can read about the on-disk format?  What's theap?  For my orderId column its much larger than rows x VARCHAR(6) (7?).  Is it some potentially sparse free store for mem mapping?

Is there any other information I can provide that might help find out why my first query is so slow?  I'm really keen to use monetdb in an app I'm developing, but being able to satisfy simple lookups as well as the lightning fast aggregations it does would be very useful!

-Will






On 29 November 2013 10:54, Stefan Manegold <Stefan.Manegold@cwi.nl> wrote:
Will,

did / can you run your first (slow) query twice in a row and report the speeds of both runs?

Thanks!
Stefan

----- Original Message -----
>
>
> On 29/11/13 11:36, Will Muldrew wrote:
> > Hi
> >
> > I've been trying out monetdb on a 350M row table (modern linux
> > workstation with 24GB RAM + SSD).  This table has a sorted TIMESTAMP
> > column and a second unsorted VARCHAR(6) column which is non-unique
> > (perhaps 50M distinct values fairly randomly distributed).
> >
> > This query is very slow, spends huge amount of time writing, and has
> > very low CPU (intermediate results?):
> >
> >    select count(*) from t_order where orderId = 'XXXXXX'
> >
> > This takes hours to complete.
> >
> > However, this query is a lot faster, even though the timestamp clause is
> > redundant (all my data is > '2012-01-01'):
> It decided it can use parallel processing effectively.
> Why it did not in the first query is unknown.
>
> >
> >    select count(*) from t_order where orderId = 'XXXXXX' and time >
> > '1970-01-01';
> >
> > The first run is slowish (10s of seconds - I guess because we're reading
> > the orderId column fully), however subsequent runs are sub-second.
> Yes, that is caching effect.
> >
> > Obviously I could write an app which puts the spurious timestamp clause
> > into the where, but it's still confusing and not ideal.  Is there
> > something I can do?
> >
> regards, Martin
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>

--
| Stefan.Manegold@CWI.nl | DB Architectures   (DA) |
| www.CWI.nl/~manegold/  | Science Park 123 (L321) |
| +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |

_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list



users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list

--
| Stefan.Manegold@CWI.nl | Database Architectures (DA) |
| www.CWI.nl/~manegold | Science Park 123 (L321) |
| +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |