Hashjoin performance with large vs small tables

Stefan Manegold Stefan.Manegold at cwi.nl
Mon May 11 18:45:53 CEST 2015

----- On May 11, 2015, at 6:36 PM, Roberto Cornacchia roberto.cornacchia at gmail.com wrote:

> Correction:
> This join takes 430ms .
> I forced swapping l and r, thus built the hash table on the larger bat, and then
> it takes 0.8ms .
> It takes 0.8ms the second time.
> The first time, it needs to create the hash table, and then it takes about 30ms.
> Still, much better than 430ms.

Ok, but indeed still the question where does this difference come from?

> Also, those 430ms are not invested. The second time will still take 430ms. So
> hashing on a very small bat is never a good investment. On the contrary,
> hashing on a larger (but not too much) table is a good investment. The next
> time a similar query comes in, it will be sub-millisecond.

Well, this is a trade-off that in in general hard to judge.
If the bigger table / BAT is a base table/BAT, the hash table will (nowadays)
be made persistent and *could* be reused --- whether it indeed will be reused,
we cannot predict. If the bigger table is a transient intermediate result,
re-use is unlikely ...

Having said that, is your smaller table a base table or an intermediate result
that is (might be) a tiny slice of a large (huge) base table?
Then current code might build the hash on the entire parent BAT rather than on
the tiny slice ...

Also: Which version of MonetDB are we talking about?


| Stefan.Manegold at CWI.nl | DB Architectures   (DA) |
| www.CWI.nl/~manegold/  | Science Park 123 (L321) |
| +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |

More information about the developers-list mailing list