Pathological behaviour when building large hash index

newer
Strange behavior: dumping results...

older
Re: Pathological behaviour when...

Will Muldrew

5 Dec 2013 5 Dec '13

2:37 p.m.

I've been investigating very slow queries on my 350M row dataset when selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

This query takes ~4.5 hours to build a thash index. Subsequent repeats hit the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId = 'FOO'

This second one is consistently finished in a few seconds or less, and doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about of time in BAThash (gdk_search.c), and incurs a lot of page faults while reading and writing at random across a multi GB mmaped thash file. Additionally these page faults are hampered by the OS trying to write dirty pages out in the background. I've got plenty of RAM overall (130GB), but the 'active' page proportion for the thash file seems to be stuck at about 25% - giving me a 0.75 probability of a miss.

* Is there some way to make my OS less keen to evict pages? (swappiness = 0 already) I should have plenty of room to have the whole thash file resident. Additionally, allowing it to sit on dirty pages longer would reduce the total write IO. * Is there some way to simply prevent large thash creation? For many applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod environment!) * Is there some way to generate the hash index with a more sympathetic algo that doesn't degrade so steeply. e.g some hash + sort. (I have literally no experience with this!)

-Will Muldrew

Attachments:

attachment.html (text/html — 1.9 KB)

Show replies by date

Stefan Manegold

5 Dec 5 Dec

3:54 p.m.

Hi Will,

could you please also share with us which version of MonetDB you are using, and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...

Hi

I've been investigating very slow queries on my 350M row dataset when selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...

This query takes ~4.5 hours to build a thash index. Subsequent repeats hit the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId = 'FOO'

This second one is consistently finished in a few seconds or less, and doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about of time in BAThash (gdk_search.c), and incurs a lot of page faults while reading and writing at random across a multi GB mmaped thash file. Additionally these page faults are hampered by the OS trying to write dirty pages out in the background. I've got plenty of RAM overall (130GB), but the 'active' page proportion for the thash file seems to be stuck at about 25% - giving me a 0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages? (swappiness = 0

already) I should have plenty of room to have the whole thash file resident. Additionally, allowing it to sit on dirty pages longer would reduce the total write IO.

In fact, if there is sufficient memory in total, there would be no need to flush dirty (non-persistent) pages at all, but we have not yet found a good way to hit this to the OS --- madvice() did not have much/any useful efficet for us on Linux.

Maybe some of the issues dicsussed here play a role: http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme... ?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For many

applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did above, which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar (string), or because the time column happens to be ordered) below the orderId predicate and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you built MonetDB yourself from source):

hash = b->batPersistence == PERSISTENT && (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 && BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) < GDK_mem_maxsize / 2; -> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more sympathetic algo

that doesn't degrade so steeply. e.g some hash + sort. (I have literally no experience with this!)

We'll also look into this. However, building a hash index inherently incurs random access, so chances are slim ...

Best, Stefan

...

-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Will Muldrew

4:32 p.m.

Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash disabled. That's totally fine for me right now, though obviously it'd be nice to be on the trunk!

W.r.t. a less random access hash generation. This is all very hand-wavey but how about: 1) generate a sequence of (hash,oid) pairs to some file or mmapped area. 2). use some well-known "external sort" algo to sort these pairs by hash (e.g. split, qsort, merge). 3). build your final hash by iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good page hit characteristics. You just need to choose a decent sort algo - ideally which works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold Stefan.Manegold@cwi.nl wrote:

...

Hi Will,

could you please also share with us which version of MonetDB you are using, and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset when selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent repeats

hit

...
the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId =

'FOO'

...
This second one is consistently finished in a few seconds or less, and doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about of

time in

...
BAThash (gdk_search.c), and incurs a lot of page faults while reading and writing at random across a multi GB mmaped thash file. Additionally these page faults are hampered by the OS trying to write dirty pages out in the background. I've got plenty of RAM overall (130GB), but the 'active' page proportion for the thash file seems to be stuck at about 25% - giving me

a

...
0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages? (swappiness

= 0

...
already) I should have plenty of room to have the whole thash file

resident.

...
Additionally, allowing it to sit on dirty pages longer would reduce the total write IO.

In fact, if there is sufficient memory in total, there would be no need to flush dirty (non-persistent) pages at all, but we have not yet found a good way to hit this to the OS --- madvice() did not have much/any useful efficet for us on Linux.

Maybe some of the issues dicsussed here play a role:

http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme... ?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For many

applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod

environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did above, which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar (string), or because the time column happens to be ordered) below the orderId predicate and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you built MonetDB yourself from source):
    hash = b->batPersistence == PERSISTENT &&
           (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 &&
           BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) <
GDK_mem_maxsize / 2; -> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more sympathetic

algo

...
that doesn't degrade so steeply. e.g some hash + sort. (I have literally

no

...
experience with this!)

We'll also look into this. However, building a hash index inherently incurs random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Will Muldrew

4:41 p.m.

Hi Stefan

I've added that hash = FALSE patch, and now my simple query returns reliably in 6 seconds. It's still a lot slower than the indexed one (obviously), and 12x slower than the one with the time > '1970-01-01' optimiser nudge (which causes parallelism), but it'll do for now.

Thanks

-Will

On 5 December 2013 16:32, Will Muldrew will.muldrew@gmail.com wrote:

...

Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash disabled. That's totally fine for me right now, though obviously it'd be nice to be on the trunk!

W.r.t. a less random access hash generation. This is all very hand-wavey but how about: 1) generate a sequence of (hash,oid) pairs to some file or mmapped area. 2). use some well-known "external sort" algo to sort these pairs by hash (e.g. split, qsort, merge). 3). build your final hash by iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good page hit characteristics. You just need to choose a decent sort algo - ideally which works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold Stefan.Manegold@cwi.nl wrote:

...
Hi Will,

could you please also share with us which version of MonetDB you are using, and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset when selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent repeats

hit

...
the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId =

'FOO'

...
This second one is consistently finished in a few seconds or less, and doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about of

time in

...
BAThash (gdk_search.c), and incurs a lot of page faults while reading

and

...
writing at random across a multi GB mmaped thash file. Additionally

these

...
page faults are hampered by the OS trying to write dirty pages out in

the

...
background. I've got plenty of RAM overall (130GB), but the 'active'

page

...
proportion for the thash file seems to be stuck at about 25% - giving

me a

...
0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages? (swappiness

= 0

...
already) I should have plenty of room to have the whole thash file

resident.

...
Additionally, allowing it to sit on dirty pages longer would reduce the total write IO.

In fact, if there is sufficient memory in total, there would be no need to flush dirty (non-persistent) pages at all, but we have not yet found a good way to hit this to the OS --- madvice() did not have much/any useful efficet for us on Linux.

Maybe some of the issues dicsussed here play a role:

http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme... ?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For many

applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod

environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did above, which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar (string), or because the time column happens to be ordered) below the orderId predicate and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you built MonetDB yourself from source):
    hash = b->batPersistence == PERSISTENT &&
           (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 &&
           BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) <
GDK_mem_maxsize / 2; -> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more sympathetic

algo

...
that doesn't degrade so steeply. e.g some hash + sort. (I have

literally no

...
experience with this!)

We'll also look into this. However, building a hash index inherently incurs random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Stefan Manegold

4:49 p.m.

Hi Will,

thank for reminding me: the optimizer still sees your query as point query and thus prevents parallelization in order to build/use the hash index.

Maybe you can "fool" the optimizer (and I hope noone else is reading what I'm writing now ... ;-)) by "disguising" the point predicate as range predicate, say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather hope (for us) that is will not work ...

Stefan

----- Original Message -----

...

Hi Stefan

I've added that hash = FALSE patch, and now my simple query returns reliably in 6 seconds. It's still a lot slower than the indexed one (obviously), and 12x slower than the one with the time > '1970-01-01' optimiser nudge (which causes parallelism), but it'll do for now.

Thanks

-Will

On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com > wrote:

Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash disabled. That's totally fine for me right now, though obviously it'd be nice to be on the trunk!

W.r.t. a less random access hash generation. This is all very hand-wavey but how about: 1) generate a sequence of (hash,oid) pairs to some file or mmapped area. 2). use some well-known "external sort" algo to sort these pairs by hash (e.g. split, qsort, merge). 3). build your final hash by iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good page hit characteristics. You just need to choose a decent sort algo - ideally which works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Hi Will,

could you please also share with us which version of MonetDB you are using, and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset when selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent repeats hit the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId = 'FOO'

This second one is consistently finished in a few seconds or less, and doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about of time in BAThash (gdk_search.c), and incurs a lot of page faults while reading and writing at random across a multi GB mmaped thash file. Additionally these page faults are hampered by the OS trying to write dirty pages out in the background. I've got plenty of RAM overall (130GB), but the 'active' page proportion for the thash file seems to be stuck at about 25% - giving me a 0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages? (swappiness = 0

already) I should have plenty of room to have the whole thash file resident. Additionally, allowing it to sit on dirty pages longer would reduce the total write IO.

In fact, if there is sufficient memory in total, there would be no need to flush dirty (non-persistent) pages at all, but we have not yet found a good way to hit this to the OS --- madvice() did not have much/any useful efficet for us on Linux.

Maybe some of the issues dicsussed here play a role: http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme... ?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For many

applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did above, which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar (string), or because the time column happens to be ordered) below the orderId predicate and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you built MonetDB yourself from source):

hash = b->batPersistence == PERSISTENT && (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 && BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) < GDK_mem_maxsize / 2; -> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more sympathetic algo

that doesn't degrade so steeply. e.g some hash + sort. (I have literally no experience with this!)

We'll also look into this. However, building a hash index inherently incurs random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Will Muldrew

4:58 p.m.

Sorry!

sql>select count(*) from t_order2 where orderId between 'AAAAAA' and 'AAAAAA'; +------+ | L1 | +======+ | 0 | +------+ 1 tuple (596.202ms)

On 5 December 2013 16:49, Stefan Manegold Stefan.Manegold@cwi.nl wrote:

...

Hi Will,

thank for reminding me: the optimizer still sees your query as point query and thus prevents parallelization in order to build/use the hash index.

Maybe you can "fool" the optimizer (and I hope noone else is reading what I'm writing now ... ;-)) by "disguising" the point predicate as range predicate, say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather hope (for us) that is will not work ...

Stefan

----- Original Message -----

...
Hi Stefan

I've added that hash = FALSE patch, and now my simple query returns

reliably

...
in 6 seconds. It's still a lot slower than the indexed one (obviously),

and

...
12x slower than the one with the time > '1970-01-01' optimiser nudge

(which

...
causes parallelism), but it'll do for now.

Thanks

-Will

On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com > wrote:

Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash disabled.

That's

...
totally fine for me right now, though obviously it'd be nice to be on the trunk!

W.r.t. a less random access hash generation. This is all very hand-wavey

but

...
how about: 1) generate a sequence of (hash,oid) pairs to some file or mmapped area. 2). use some well-known "external sort" algo to sort these pairs by hash (e.g. split, qsort, merge). 3). build your final hash by iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good page hit characteristics. You just need to choose a decent sort algo - ideally

which

...
works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl >

wrote:

...
Hi Will,

could you please also share with us which version of MonetDB you are

using,

...
and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset when selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent repeats

hit

...
...
the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId =

'FOO'

...
...
This second one is consistently finished in a few seconds or less, and doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about of

time

...
...
in BAThash (gdk_search.c), and incurs a lot of page faults while reading

and

...
...
writing at random across a multi GB mmaped thash file. Additionally

these

...
...
page faults are hampered by the OS trying to write dirty pages out in

the

...
...
background. I've got plenty of RAM overall (130GB), but the 'active'

page

...
...
proportion for the thash file seems to be stuck at about 25% - giving

me a

...
...
0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages?

(swappiness = 0

...
...
already) I should have plenty of room to have the whole thash file resident. Additionally, allowing it to sit on dirty pages longer would reduce the total write IO.

In fact, if there is sufficient memory in total, there would be no need

to

...
flush dirty (non-persistent) pages at all, but we have not yet found a good

way to

...
hit this to the OS --- madvice() did not have much/any useful efficet for us

on

...
Linux.

Maybe some of the issues dicsussed here play a role:

http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme...

...
?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For many

applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod

environment!)

...
Not easily right now. However, you can either add the "fake" time predicate as you did above, which (in this case) triggers MonetDB's optimizer to push the time

predicate

...
(either because it's in timestamp (int) rather then varchar (string), or because the time column happens to be ordered) below the orderId

predicate

...
and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you built MonetDB yourself from source):

hash = b->batPersistence == PERSISTENT && (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 && BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) < GDK_mem_maxsize /

2;

...
-> hash = FALSE;

We'll consider whether we can provide a better solution in a future

release.

...
...

Is there some way to generate the hash index with a more sympathetic

algo

...
...
that doesn't degrade so steeply. e.g some hash + sort. (I have

literally no

...
...
experience with this!)

We'll also look into this. However, building a hash index inherently

incurs

...
random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Stefan Manegold

5:04 p.m.

hm, that's with the index in place?

If so: good for our optimizer, but unfortunately not so good for you ...

If not: - good for you! - impressive for our "execution engine" (scan) (- "embarrassing" for our optimizer ... well, it's a minor detail ...)

;-)

Stefan

----- Original Message -----

...

Sorry!

sql>select count(*) from t_order2 where orderId between 'AAAAAA' and 'AAAAAA'; +------+ | L1 | +======+ | 0 | +------+ 1 tuple (596.202ms)

On 5 December 2013 16:49, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Hi Will,

thank for reminding me: the optimizer still sees your query as point query and thus prevents parallelization in order to build/use the hash index.

Maybe you can "fool" the optimizer (and I hope noone else is reading what I'm writing now ... ;-)) by "disguising" the point predicate as range predicate, say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather hope (for us) that is will not work ...

Stefan

----- Original Message -----

...
Hi Stefan

I've added that hash = FALSE patch, and now my simple query returns reliably in 6 seconds. It's still a lot slower than the indexed one (obviously), and 12x slower than the one with the time > '1970-01-01' optimiser nudge (which causes parallelism), but it'll do for now.

Thanks

-Will

On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com > wrote:

Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash disabled. That's totally fine for me right now, though obviously it'd be nice to be on the trunk!

W.r.t. a less random access hash generation. This is all very hand-wavey but how about: 1) generate a sequence of (hash,oid) pairs to some file or mmapped area. 2). use some well-known "external sort" algo to sort these pairs by hash (e.g. split, qsort, merge). 3). build your final hash by iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good page hit characteristics. You just need to choose a decent sort algo - ideally which works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Hi Will,

could you please also share with us which version of MonetDB you are using, and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset when selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent repeats hit the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId = 'FOO'

This second one is consistently finished in a few seconds or less, and doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about of time in BAThash (gdk_search.c), and incurs a lot of page faults while reading and writing at random across a multi GB mmaped thash file. Additionally these page faults are hampered by the OS trying to write dirty pages out in the background. I've got plenty of RAM overall (130GB), but the 'active' page proportion for the thash file seems to be stuck at about 25% - giving me a 0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages? (swappiness =

0 already) I should have plenty of room to have the whole thash file resident. Additionally, allowing it to sit on dirty pages longer would reduce the total write IO.

In fact, if there is sufficient memory in total, there would be no need to flush dirty (non-persistent) pages at all, but we have not yet found a good way to hit this to the OS --- madvice() did not have much/any useful efficet for us on Linux.

Maybe some of the issues dicsussed here play a role: http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme... ?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For many

applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did above, which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar (string), or because the time column happens to be ordered) below the orderId predicate and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you built MonetDB yourself from source):

hash = b->batPersistence == PERSISTENT && (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 && BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) < GDK_mem_maxsize / 2; -> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more sympathetic

algo that doesn't degrade so steeply. e.g some hash + sort. (I have literally no experience with this!)

We'll also look into this. However, building a hash index inherently incurs random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Will Muldrew

7:57 p.m.

No index! On 5 Dec 2013 17:05, "Stefan Manegold" Stefan.Manegold@cwi.nl wrote:

...

hm, that's with the index in place?

If so: good for our optimizer, but unfortunately not so good for you ...

If not:

good for you!

impressive for our "execution engine" (scan)

(- "embarrassing" for our optimizer ... well, it's a minor detail ...)

;-)

Stefan

----- Original Message -----

...
Sorry!

sql>select count(*) from t_order2 where orderId between 'AAAAAA' and 'AAAAAA'; +------+ | L1 | +======+ | 0 | +------+ 1 tuple (596.202ms)

On 5 December 2013 16:49, Stefan Manegold < Stefan.Manegold@cwi.nl >

wrote:

...
Hi Will,

thank for reminding me: the optimizer still sees your query as point

query

...
and thus prevents parallelization in order to build/use the hash index.

Maybe you can "fool" the optimizer (and I hope noone else is reading

what I'm

...
writing now ... ;-)) by "disguising" the point predicate as range

predicate,

...
say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather hope

(for

...
us) that is will not work ...

Stefan

----- Original Message -----

...
Hi Stefan

I've added that hash = FALSE patch, and now my simple query returns reliably in 6 seconds. It's still a lot slower than the indexed one

(obviously), and

...
...
12x slower than the one with the time > '1970-01-01' optimiser nudge

(which

...
...
causes parallelism), but it'll do for now.

Thanks

-Will

On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com >

wrote:

...
...
Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash disabled. That's totally fine for me right now, though obviously it'd be nice to be on

the

...
...
trunk!

W.r.t. a less random access hash generation. This is all very

hand-wavey

...
...
but how about: 1) generate a sequence of (hash,oid) pairs to some file or mmapped area. 2). use some well-known "external sort" algo to sort

these

...
...
pairs by hash (e.g. split, qsort, merge). 3). build your final hash by iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good page

hit

...
...
characteristics. You just need to choose a decent sort algo - ideally

which

...
...
works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl >

wrote:

...
...
Hi Will,

could you please also share with us which version of MonetDB you are

using,

...
...
and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset when selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent

repeats

...
...
...
hit the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId = 'FOO'

This second one is consistently finished in a few seconds or less,

and

...
...
...
doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about of time in BAThash (gdk_search.c), and incurs a lot of page faults while

reading and

...
...
...
writing at random across a multi GB mmaped thash file. Additionally

these

...
...
...
page faults are hampered by the OS trying to write dirty pages out

in the

...
...
...
background. I've got plenty of RAM overall (130GB), but the 'active'

page

...
...
...
proportion for the thash file seems to be stuck at about 25% -

giving me

...
...
...
a 0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages?

(swappiness =

...
...
...
0 already) I should have plenty of room to have the whole thash file resident. Additionally, allowing it to sit on dirty pages longer would reduce

the

...
...
...
total write IO.

In fact, if there is sufficient memory in total, there would be no

need to

...
...
flush dirty (non-persistent) pages at all, but we have not yet found a good

way

...
...
to hit this to the OS --- madvice() did not have much/any useful efficet for

us on

...
...
Linux.

Maybe some of the issues dicsussed here play a role:

http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme...

...
...
?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For many

applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did above, which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar (string),

or

...
...
because the time column happens to be ordered) below the orderId

predicate

...
...
and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you built MonetDB yourself from source):

hash = b->batPersistence == PERSISTENT && (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 && BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) < GDK_mem_maxsize

/ 2;

...
...
-> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more

sympathetic

...
...
...
algo that doesn't degrade so steeply. e.g some hash + sort. (I have

literally

...
...
...
no experience with this!)

We'll also look into this. However, building a hash index inherently

incurs

...
...
random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Will Muldrew

6 Dec 6 Dec

11:28 a.m.

w.r.t. the slow hash index creation, I did find an OS setting that improved things dramatically - first hash building query is now ~3 minutes rather than 4.5 hrs.

I found that my /proc/sys/vm/dirty_background_ratio setting was 0, and so the OS kicked in and started writing as soon as pages were dirty. This gets in the way of the page fault servicing. I set it to 20, which now means that the OS doesn't start flushing for a while and so mserver gets to max out CPU dirtying up pages during the thash creation. A little bit later, the flusher kicks in and mserver slows down to 0.1 CPU. Flushing completes, mserver picks up to 100%, etc... We make a lot quicker progress.

I'm still confused by why my OS doesn't seem to want to keep all the pages of thash in memory. If I look in /proc/{pid}/numa_maps I can see that during the thash creation, for the thash mem mapped file the 'active' page count (which I believe is the memory resident portion) seems to not want to increase to more than ~0.25 of the total 'mapped' page count. My assumption is that if this proportion was closer to 1.0, then my memory access wouldn't be causing page faults in the first place. I've got lots of RAM spare, and am using interleaved numa settings, so I don't understand what's preventing active growing. Does anyone have an idea, or know how I can diagnose this?

-Will

On 5 December 2013 19:57, Will Muldrew will.muldrew@gmail.com wrote:

...

No index! On 5 Dec 2013 17:05, "Stefan Manegold" Stefan.Manegold@cwi.nl wrote:

...
hm, that's with the index in place?

If so: good for our optimizer, but unfortunately not so good for you ...

If not:

good for you!

impressive for our "execution engine" (scan)

(- "embarrassing" for our optimizer ... well, it's a minor detail ...)

;-)

Stefan

----- Original Message -----

...
Sorry!

sql>select count(*) from t_order2 where orderId between 'AAAAAA' and 'AAAAAA'; +------+ | L1 | +======+ | 0 | +------+ 1 tuple (596.202ms)

On 5 December 2013 16:49, Stefan Manegold < Stefan.Manegold@cwi.nl >

wrote:

...
Hi Will,

thank for reminding me: the optimizer still sees your query as point

query

...
and thus prevents parallelization in order to build/use the hash index.

Maybe you can "fool" the optimizer (and I hope noone else is reading

what I'm

...
writing now ... ;-)) by "disguising" the point predicate as range

predicate,

...
say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather hope

(for

...
us) that is will not work ...

Stefan

----- Original Message -----

...
Hi Stefan

I've added that hash = FALSE patch, and now my simple query returns reliably in 6 seconds. It's still a lot slower than the indexed one

(obviously), and

...
...
12x slower than the one with the time > '1970-01-01' optimiser nudge

(which

...
...
causes parallelism), but it'll do for now.

Thanks

-Will

On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com >

wrote:

...
...
Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash disabled. That's totally fine for me right now, though obviously it'd be nice to be on

the

...
...
trunk!

W.r.t. a less random access hash generation. This is all very

hand-wavey

...
...
but how about: 1) generate a sequence of (hash,oid) pairs to some file or mmapped area. 2). use some well-known "external sort" algo to sort

these

...
...
pairs by hash (e.g. split, qsort, merge). 3). build your final hash by iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good page

hit

...
...
characteristics. You just need to choose a decent sort algo - ideally

which

...
...
works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl >

wrote:

...
...
Hi Will,

could you please also share with us which version of MonetDB you are

using,

...
...
and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset

when

...
...
...
selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent

repeats

...
...
...
hit the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId = 'FOO'

This second one is consistently finished in a few seconds or less,

and

...
...
...
doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about

of

...
...
...
time in BAThash (gdk_search.c), and incurs a lot of page faults while

reading and

...
...
...
writing at random across a multi GB mmaped thash file. Additionally

these

...
...
...
page faults are hampered by the OS trying to write dirty pages out

in the

...
...
...
background. I've got plenty of RAM overall (130GB), but the

'active' page

...
...
...
proportion for the thash file seems to be stuck at about 25% -

giving me

...
...
...
a 0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages?

(swappiness =

...
...
...
0 already) I should have plenty of room to have the whole thash file resident. Additionally, allowing it to sit on dirty pages longer would reduce

the

...
...
...
total write IO.

In fact, if there is sufficient memory in total, there would be no

need to

...
...
flush dirty (non-persistent) pages at all, but we have not yet found a good

way

...
...
to hit this to the OS --- madvice() did not have much/any useful efficet for

us on

...
...
Linux.

Maybe some of the issues dicsussed here play a role:

http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme...

...
...
?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For many

applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did

above,

...
...
which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar (string),

or

...
...
because the time column happens to be ordered) below the orderId

predicate

...
...
and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you built MonetDB yourself from source):

hash = b->batPersistence == PERSISTENT && (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 && BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) <

GDK_mem_maxsize / 2;

...
...
-> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more

sympathetic

...
...
...
algo that doesn't degrade so steeply. e.g some hash + sort. (I have

literally

...
...
...
no experience with this!)

We'll also look into this. However, building a hash index inherently

incurs

...
...
random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Stefan Manegold

11:37 a.m.

----- Original Message -----

...

w.r.t. the slow hash index creation, I did find an OS setting that improved things dramatically - first hash building query is now ~3 minutes rather than 4.5 hrs.

I found that my /proc/sys/vm/dirty_background_ratio setting was 0, and so the OS kicked in and started writing as soon as pages were dirty. This gets in the way of the page fault servicing. I set it to 20, which now means that the OS doesn't start flushing for a while and so mserver gets to max out CPU dirtying up pages during the thash creation. A little bit later, the flusher kicks in and mserver slows down to 0.1 CPU. Flushing completes, mserver picks up to 100%, etc... We make a lot quicker progress.

This is great news. Thank you very much for sharing!

On our Fedora 18 machines, the default dirty_background_ratio appears to be 10.

...

I'm still confused by why my OS doesn't seem to want to keep all the pages of thash in memory. If I look in /proc/{pid}/numa_maps I can see that during the thash creation, for the thash mem mapped file the 'active' page count (which I believe is the memory resident portion) seems to not want to increase to more than ~0.25 of the total 'mapped' page count. My assumption is that if this proportion was closer to 1.0, then my memory access wouldn't be causing page faults in the first place. I've got lots of RAM spare, and am using interleaved numa settings, so I don't understand what's preventing active growing. Does anyone have an idea, or know how I can diagnose this?

I don't right now, we I hope we can collect all related experiences and knowledge in this thread and turn it into general advice.

Thanks, again, for your patience and willingness to experiment and share your experiences!

Stefan

...

-Will

On 5 December 2013 19:57, Will Muldrew < will.muldrew@gmail.com > wrote:

No index! On 5 Dec 2013 17:05, "Stefan Manegold" < Stefan.Manegold@cwi.nl > wrote:

hm, that's with the index in place?

If so: good for our optimizer, but unfortunately not so good for you ...

If not:

good for you!

impressive for our "execution engine" (scan)

(- "embarrassing" for our optimizer ... well, it's a minor detail ...)

;-)

Stefan

----- Original Message -----

...
Sorry!

sql>select count(*) from t_order2 where orderId between 'AAAAAA' and 'AAAAAA'; +------+ | L1 | +======+ | 0 | +------+ 1 tuple (596.202ms)

On 5 December 2013 16:49, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Hi Will,

thank for reminding me: the optimizer still sees your query as point query and thus prevents parallelization in order to build/use the hash index.

Maybe you can "fool" the optimizer (and I hope noone else is reading what I'm writing now ... ;-)) by "disguising" the point predicate as range predicate, say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather hope (for us) that is will not work ...

Stefan

----- Original Message -----

...
Hi Stefan

I've added that hash = FALSE patch, and now my simple query returns reliably in 6 seconds. It's still a lot slower than the indexed one (obviously), and 12x slower than the one with the time > '1970-01-01' optimiser nudge (which causes parallelism), but it'll do for now.

Thanks

-Will

On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com > wrote:

Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash disabled. That's totally fine for me right now, though obviously it'd be nice to be on the trunk!

W.r.t. a less random access hash generation. This is all very hand-wavey but how about: 1) generate a sequence of (hash,oid) pairs to some file or mmapped area. 2). use some well-known "external sort" algo to sort these pairs by hash (e.g. split, qsort, merge). 3). build your final hash by iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good page hit characteristics. You just need to choose a decent sort algo - ideally which works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Hi Will,

could you please also share with us which version of MonetDB you are using, and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset when selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent repeats hit the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId = 'FOO'

This second one is consistently finished in a few seconds or less, and doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about of time in BAThash (gdk_search.c), and incurs a lot of page faults while reading and writing at random across a multi GB mmaped thash file. Additionally these page faults are hampered by the OS trying to write dirty pages out in the background. I've got plenty of RAM overall (130GB), but the 'active' page proportion for the thash file seems to be stuck at about 25% - giving me a 0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages? (swappiness

= 0 already) I should have plenty of room to have the whole thash file resident. Additionally, allowing it to sit on dirty pages longer would reduce the total write IO.

In fact, if there is sufficient memory in total, there would be no need to flush dirty (non-persistent) pages at all, but we have not yet found a good way to hit this to the OS --- madvice() did not have much/any useful efficet for us on Linux.

Maybe some of the issues dicsussed here play a role: http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme... ?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For many

applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did above, which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar (string), or because the time column happens to be ordered) below the orderId predicate and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you built MonetDB yourself from source):

hash = b->batPersistence == PERSISTENT && (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 && BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) < GDK_mem_maxsize / 2; -> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more sympathetic

algo that doesn't degrade so steeply. e.g some hash + sort. (I have literally no experience with this!)

We'll also look into this. However, building a hash index inherently incurs random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Will Muldrew

11:40 a.m.

No problem - monetdb is an exciting project!

On 6 December 2013 11:37, Stefan Manegold Stefan.Manegold@cwi.nl wrote:

...

----- Original Message -----

...
w.r.t. the slow hash index creation, I did find an OS setting that

improved

...
things dramatically - first hash building query is now ~3 minutes rather than 4.5 hrs.

I found that my /proc/sys/vm/dirty_background_ratio setting was 0, and

so the

...
OS kicked in and started writing as soon as pages were dirty. This gets

in

...
the way of the page fault servicing. I set it to 20, which now means that the OS doesn't start flushing for a while and so mserver gets to max out

CPU

...
dirtying up pages during the thash creation. A little bit later, the

flusher

...
kicks in and mserver slows down to 0.1 CPU. Flushing completes, mserver picks up to 100%, etc... We make a lot quicker progress.

This is great news. Thank you very much for sharing!

On our Fedora 18 machines, the default dirty_background_ratio appears to be 10.

...
I'm still confused by why my OS doesn't seem to want to keep all the

pages of

...
thash in memory. If I look in /proc/{pid}/numa_maps I can see that during the thash creation, for the thash mem mapped file the 'active' page count (which I believe is the memory resident portion) seems to not want to increase to more than ~0.25 of the total 'mapped' page count. My

assumption

...
is that if this proportion was closer to 1.0, then my memory access

wouldn't

...
be causing page faults in the first place. I've got lots of RAM spare,

and

...
am using interleaved numa settings, so I don't understand what's

preventing

...
active growing. Does anyone have an idea, or know how I can diagnose

this?

I don't right now, we I hope we can collect all related experiences and knowledge in this thread and turn it into general advice.

Thanks, again, for your patience and willingness to experiment and share your experiences!

Stefan

...
-Will

On 5 December 2013 19:57, Will Muldrew < will.muldrew@gmail.com > wrote:

No index! On 5 Dec 2013 17:05, "Stefan Manegold" < Stefan.Manegold@cwi.nl > wrote:

hm, that's with the index in place?

If so: good for our optimizer, but unfortunately not so good for you ...

If not:

good for you!

impressive for our "execution engine" (scan)

(- "embarrassing" for our optimizer ... well, it's a minor detail ...)

;-)

Stefan

----- Original Message -----

...
Sorry!

sql>select count(*) from t_order2 where orderId between 'AAAAAA' and 'AAAAAA'; +------+ | L1 | +======+ | 0 | +------+ 1 tuple (596.202ms)

On 5 December 2013 16:49, Stefan Manegold < Stefan.Manegold@cwi.nl >

wrote:

...
...
Hi Will,

thank for reminding me: the optimizer still sees your query as point

query

...
...
and thus prevents parallelization in order to build/use the hash index.

Maybe you can "fool" the optimizer (and I hope noone else is reading

what

...
...
I'm writing now ... ;-)) by "disguising" the point predicate as range predicate, say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather hope (for us) that is will not work ...

Stefan

----- Original Message -----

...
Hi Stefan

I've added that hash = FALSE patch, and now my simple query returns reliably in 6 seconds. It's still a lot slower than the indexed one

(obviously),

...
...
...
and 12x slower than the one with the time > '1970-01-01' optimiser nudge (which causes parallelism), but it'll do for now.

Thanks

-Will

On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com >

wrote:

...
...
...
Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash

disabled.

...
...
...
That's totally fine for me right now, though obviously it'd be nice to be

on the

...
...
...
trunk!

W.r.t. a less random access hash generation. This is all very

hand-wavey

...
...
...
but how about: 1) generate a sequence of (hash,oid) pairs to some file or mmapped area. 2). use some well-known "external sort" algo to sort

these

...
...
...
pairs by hash (e.g. split, qsort, merge). 3). build your final hash

by

...
...
...
iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good page

hit

...
...
...
characteristics. You just need to choose a decent sort algo - ideally which works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Hi Will,

could you please also share with us which version of MonetDB you are using, and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset

when

...
...
...
...
selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent

repeats

...
...
...
...
hit the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId

=

...
...
...
...
'FOO'

This second one is consistently finished in a few seconds or less,

and

...
...
...
...
doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about

of

...
...
...
...
time in BAThash (gdk_search.c), and incurs a lot of page faults while

reading

...
...
...
...
and writing at random across a multi GB mmaped thash file. Additionally these page faults are hampered by the OS trying to write dirty pages out

in

...
...
...
...
the background. I've got plenty of RAM overall (130GB), but the

'active'

...
...
...
...
page proportion for the thash file seems to be stuck at about 25% -

giving

...
...
...
...
me a 0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages?

(swappiness

...
...
...
...
= 0 already) I should have plenty of room to have the whole thash file resident. Additionally, allowing it to sit on dirty pages longer would

reduce the

...
...
...
...
total write IO.

In fact, if there is sufficient memory in total, there would be no

need

...
...
...
to flush dirty (non-persistent) pages at all, but we have not yet found a

good way

...
...
...
to hit this to the OS --- madvice() did not have much/any useful efficet

for us

...
...
...
on Linux.

Maybe some of the issues dicsussed here play a role:

http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme...

...
...
...
?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For

many

...
...
...
...
applications I'd rather have slower consistent queries, then incur

a

...
...
...
...
massively slow query following a restart (not great in a prod environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did

above,

...
...
...
which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar

(string), or

...
...
...
because the time column happens to be ordered) below the orderId predicate and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you

built

...
...
...
MonetDB yourself from source):

hash = b->batPersistence == PERSISTENT && (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 && BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) <

GDK_mem_maxsize /

...
...
...
2; -> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more

sympathetic

...
...
...
...
algo that doesn't degrade so steeply. e.g some hash + sort. (I have literally no experience with this!)

We'll also look into this. However, building a hash index inherently incurs random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Stefan Manegold

12:08 p.m.

Will et al.,

a value "0" for /proc/sys/vm/dirty_background_ratio could also meant that /proc/sys/vm/dirty_background_bytes was set to a non-0 value; cf., https://www.kernel.org/doc/Documentation/sysctl/vm.txt

Did you try whether setting /proc/sys/vm/dirty_background_ratio to a value

...

20 would result in /proc/{pid}/numa_maps growing beyond 25% ?

Stefan

----- Original Message -----

...

No problem - monetdb is an exciting project!

On 6 December 2013 11:37, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

----- Original Message -----

...
w.r.t. the slow hash index creation, I did find an OS setting that improved things dramatically - first hash building query is now ~3 minutes rather than 4.5 hrs.

I found that my /proc/sys/vm/dirty_background_ratio setting was 0, and so the OS kicked in and started writing as soon as pages were dirty. This gets in the way of the page fault servicing. I set it to 20, which now means that the OS doesn't start flushing for a while and so mserver gets to max out CPU dirtying up pages during the thash creation. A little bit later, the flusher kicks in and mserver slows down to 0.1 CPU. Flushing completes, mserver picks up to 100%, etc... We make a lot quicker progress.

This is great news. Thank you very much for sharing!

On our Fedora 18 machines, the default dirty_background_ratio appears to be 10.

...
I'm still confused by why my OS doesn't seem to want to keep all the pages of thash in memory. If I look in /proc/{pid}/numa_maps I can see that during the thash creation, for the thash mem mapped file the 'active' page count (which I believe is the memory resident portion) seems to not want to increase to more than ~0.25 of the total 'mapped' page count. My assumption is that if this proportion was closer to 1.0, then my memory access wouldn't be causing page faults in the first place. I've got lots of RAM spare, and am using interleaved numa settings, so I don't understand what's preventing active growing. Does anyone have an idea, or know how I can diagnose this?

I don't right now, we I hope we can collect all related experiences and knowledge in this thread and turn it into general advice.

Thanks, again, for your patience and willingness to experiment and share your experiences!

Stefan

...
-Will

On 5 December 2013 19:57, Will Muldrew < will.muldrew@gmail.com > wrote:

No index! On 5 Dec 2013 17:05, "Stefan Manegold" < Stefan.Manegold@cwi.nl > wrote:

hm, that's with the index in place?

If so: good for our optimizer, but unfortunately not so good for you ...

If not:

good for you!

impressive for our "execution engine" (scan)

(- "embarrassing" for our optimizer ... well, it's a minor detail ...)

;-)

Stefan

----- Original Message -----

...
Sorry!

sql>select count(*) from t_order2 where orderId between 'AAAAAA' and 'AAAAAA'; +------+ | L1 | +======+ | 0 | +------+ 1 tuple (596.202ms)

On 5 December 2013 16:49, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Hi Will,

thank for reminding me: the optimizer still sees your query as point query and thus prevents parallelization in order to build/use the hash index.

Maybe you can "fool" the optimizer (and I hope noone else is reading what I'm writing now ... ;-)) by "disguising" the point predicate as range predicate, say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather hope (for us) that is will not work ...

Stefan

----- Original Message -----

...
Hi Stefan

I've added that hash = FALSE patch, and now my simple query returns reliably in 6 seconds. It's still a lot slower than the indexed one (obviously), and 12x slower than the one with the time > '1970-01-01' optimiser nudge (which causes parallelism), but it'll do for now.

Thanks

-Will

On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com > wrote:

Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash disabled. That's totally fine for me right now, though obviously it'd be nice to be on the trunk!

W.r.t. a less random access hash generation. This is all very hand-wavey but how about: 1) generate a sequence of (hash,oid) pairs to some file or mmapped area. 2). use some well-known "external sort" algo to sort these pairs by hash (e.g. split, qsort, merge). 3). build your final hash by iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good page hit characteristics. You just need to choose a decent sort algo - ideally which works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Hi Will,

could you please also share with us which version of MonetDB you are using, and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset when selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent repeats hit the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and orderId = 'FOO'

This second one is consistently finished in a few seconds or less, and doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge about of time in BAThash (gdk_search.c), and incurs a lot of page faults while reading and writing at random across a multi GB mmaped thash file. Additionally these page faults are hampered by the OS trying to write dirty pages out in the background. I've got plenty of RAM overall (130GB), but the 'active' page proportion for the thash file seems to be stuck at about 25% - giving me a 0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages?

(swappiness = 0 already) I should have plenty of room to have the whole thash file resident. Additionally, allowing it to sit on dirty pages longer would reduce the total write IO.

In fact, if there is sufficient memory in total, there would be no need to flush dirty (non-persistent) pages at all, but we have not yet found a good way to hit this to the OS --- madvice() did not have much/any useful efficet for us on Linux.

Maybe some of the issues dicsussed here play a role: http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme... ?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For many

applications I'd rather have slower consistent queries, then incur a massively slow query following a restart (not great in a prod environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did above, which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar (string), or because the time column happens to be ordered) below the orderId predicate and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you built MonetDB yourself from source):

hash = b->batPersistence == PERSISTENT && (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 && BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) < GDK_mem_maxsize / 2; -> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more

sympathetic algo that doesn't degrade so steeply. e.g some hash + sort. (I have literally no experience with this!)

We'll also look into this. However, building a hash index inherently incurs random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Will Muldrew

1:26 p.m.

Hi Stefan

[13:24:23] $ cat /proc/sys/vm/dirty_background_bytes 0

Think it was that way before (I know it's mutually exclusive with the ratio)

The new setting didn't appear to have any effect on the active ratio.

-Will

On 6 December 2013 12:08, Stefan Manegold Stefan.Manegold@cwi.nl wrote:

...

Will et al.,

a value "0" for /proc/sys/vm/dirty_background_ratio could also meant that /proc/sys/vm/dirty_background_bytes was set to a non-0 value; cf., https://www.kernel.org/doc/Documentation/sysctl/vm.txt

Did you try whether setting /proc/sys/vm/dirty_background_ratio to a value

...
20 would result in /proc/{pid}/numa_maps growing beyond 25% ?

Stefan

----- Original Message -----

...
No problem - monetdb is an exciting project!

On 6 December 2013 11:37, Stefan Manegold < Stefan.Manegold@cwi.nl >

wrote:

...
----- Original Message -----

...
w.r.t. the slow hash index creation, I did find an OS setting that

improved

...
...
things dramatically - first hash building query is now ~3 minutes

rather

...
...
than 4.5 hrs.

I found that my /proc/sys/vm/dirty_background_ratio setting was 0, and

so

...
...
the OS kicked in and started writing as soon as pages were dirty. This

gets in

...
...
the way of the page fault servicing. I set it to 20, which now means

that

...
...
the OS doesn't start flushing for a while and so mserver gets to max

out

...
...
CPU dirtying up pages during the thash creation. A little bit later, the flusher kicks in and mserver slows down to 0.1 CPU. Flushing completes, mserver picks up to 100%, etc... We make a lot quicker progress.

This is great news. Thank you very much for sharing!

On our Fedora 18 machines, the default dirty_background_ratio appears to

be

...

...
I'm still confused by why my OS doesn't seem to want to keep all the

pages

...
...
of thash in memory. If I look in /proc/{pid}/numa_maps I can see that

during

...
...
the thash creation, for the thash mem mapped file the 'active' page

count

...
...
(which I believe is the memory resident portion) seems to not want to increase to more than ~0.25 of the total 'mapped' page count. My

assumption

...
...
is that if this proportion was closer to 1.0, then my memory access wouldn't be causing page faults in the first place. I've got lots of RAM spare,

and

...
...
am using interleaved numa settings, so I don't understand what's

preventing

...
...
active growing. Does anyone have an idea, or know how I can diagnose

this?

...
I don't right now, we I hope we can collect all related experiences and knowledge in this thread and turn it into general advice.

Thanks, again, for your patience and willingness to experiment and share

your

...
experiences!

Stefan

...
-Will

On 5 December 2013 19:57, Will Muldrew < will.muldrew@gmail.com >

wrote:

...
...
No index! On 5 Dec 2013 17:05, "Stefan Manegold" < Stefan.Manegold@cwi.nl >

wrote:

...
...
hm, that's with the index in place?

If so: good for our optimizer, but unfortunately not so good for you ...

If not:

good for you!

impressive for our "execution engine" (scan)

(- "embarrassing" for our optimizer ... well, it's a minor detail ...)

;-)

Stefan

----- Original Message -----

...
Sorry!

sql>select count(*) from t_order2 where orderId between 'AAAAAA' and 'AAAAAA'; +------+ | L1 | +======+ | 0 | +------+ 1 tuple (596.202ms)

On 5 December 2013 16:49, Stefan Manegold < Stefan.Manegold@cwi.nl > wrote:

Hi Will,

thank for reminding me: the optimizer still sees your query as point query and thus prevents parallelization in order to build/use the hash

index.

...
...
...
Maybe you can "fool" the optimizer (and I hope noone else is reading

what

...
...
...
I'm writing now ... ;-)) by "disguising" the point predicate as range predicate, say, "where orderId between 'FOO' and 'FOO'" --- in fact, I rather

hope

...
...
...
(for us) that is will not work ...

Stefan

----- Original Message -----

...
Hi Stefan

I've added that hash = FALSE patch, and now my simple query returns reliably in 6 seconds. It's still a lot slower than the indexed one

(obviously),

...
...
...
...
and 12x slower than the one with the time > '1970-01-01' optimiser

nudge

...
...
...
...
(which causes parallelism), but it'll do for now.

Thanks

-Will

On 5 December 2013 16:32, Will Muldrew < will.muldrew@gmail.com > wrote:

Hi Stefan

Here are my version details:

MonetDB-11.15.17 CentOS release 6.3 kernel 3.6.2

I think I might proceed with a patched version with the hash

disabled.

...
...
...
...
That's totally fine for me right now, though obviously it'd be nice to be

on

...
...
...
...
the trunk!

W.r.t. a less random access hash generation. This is all very hand-wavey but how about: 1) generate a sequence of (hash,oid) pairs to some file

or

...
...
...
...
mmapped area. 2). use some well-known "external sort" algo to sort these pairs by hash (e.g. split, qsort, merge). 3). build your final

hash by

...
...
...
...
iterating through this sorted datastructure.

Your memory writing would then be sequential and you'd get good

page

...
...
...
...
hit characteristics. You just need to choose a decent sort algo -

ideally

...
...
...
...
which works transparently over memmapped data.

What do you think? I might even have a crack at it myself :)

-Will

On 5 December 2013 15:54, Stefan Manegold < Stefan.Manegold@cwi.nl> wrote:

Hi Will,

could you please also share with us which version of MonetDB you

are

...
...
...
...
using, and OS you are running?

Pleass find more comments & questions inlined below.

----- Original Message -----

...
Hi

I've been investigating very slow queries on my 350M row dataset

when

...
...
...
...
...
selecting for equality on a non-unique string id. e.g.

select count(*) from t_order where orderId = 'FOO';

Is your orderId column unique, or are there duplicate values?

...
This query takes ~4.5 hours to build a thash index. Subsequent repeats hit the cache and are sub milli.

select count(*) from t_order where time > '1970-01-01' and

orderId =

...
...
...
...
...
'FOO'

This second one is consistently finished in a few seconds or

less,

...
...
...
...
...
and doesn't create any indexes.

The first query is a bit of a show-stopper. It spends a huge

about of

...
...
...
...
...
time in BAThash (gdk_search.c), and incurs a lot of page faults while

reading

...
...
...
...
...
and writing at random across a multi GB mmaped thash file.

Additionally

...
...
...
...
...
these page faults are hampered by the OS trying to write dirty pages

out in

...
...
...
...
...
the background. I've got plenty of RAM overall (130GB), but the

'active'

...
...
...
...
...
page proportion for the thash file seems to be stuck at about 25% -

giving

...
...
...
...
...
me a 0.75 probability of a miss.

Is there some way to make my OS less keen to evict pages?

(swappiness = 0 already) I should have plenty of room to have the whole thash

file

...
...
...
...
...
resident. Additionally, allowing it to sit on dirty pages longer would

reduce

...
...
...
...
...
the total write IO.

In fact, if there is sufficient memory in total, there would be no

need

...
...
...
...
to flush dirty (non-persistent) pages at all, but we have not yet found a

good

...
...
...
...
way to hit this to the OS --- madvice() did not have much/any useful efficet

for

...
...
...
...
us on Linux.

Maybe some of the issues dicsussed here play a role:

http://engineering.linkedin.com/performance/optimizing-linux-memory-manageme...

...
...
...
...
?

While we have not yet encountered the NUMA problem reported there, we have encountered the transparent huge page problems, and usually disable them as also suggested in the above posting.

...

Is there some way to simply prevent large thash creation? For

many

...
...
...
...
...
applications I'd rather have slower consistent queries, then

incur a

...
...
...
...
...
massively slow query following a restart (not great in a prod environment!)

Not easily right now. However, you can either add the "fake" time predicate as you did

above,

...
...
...
...
which (in this case) triggers MonetDB's optimizer to push the time predicate (either because it's in timestamp (int) rather then varchar

(string),

...
...
...
...
or because the time column happens to be ordered) below the orderId predicate and this avoid the hash build, or change the following line(s) in gdk/gdk_select.c (in case you

built

...
...
...
...
MonetDB yourself from source):

hash = b->batPersistence == PERSISTENT && (size_t) ATOMsize(b->ttype) > sizeof(BUN) / 4 && BATcount(b) * (ATOMsize(b->ttype) + 2 * sizeof(BUN)) <

GDK_mem_maxsize

...
...
...
...
/ 2; -> hash = FALSE;

We'll consider whether we can provide a better solution in a future release.

...

Is there some way to generate the hash index with a more

sympathetic algo that doesn't degrade so steeply. e.g some hash + sort. (I have literally no experience with this!)

We'll also look into this. However, building a hash index

inherently

...
...
...
...
incurs random access, so chances are slim ...

Best, Stefan

...
-Will Muldrew

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

3817

Age (days ago)

3818

Last active (days ago)

List overview

Download

12 comments

2 participants

tags (0)

participants (2)

Stefan Manegold
Will Muldrew