Hashjoin performance with large vs small tables

Roberto Cornacchia roberto.cornacchia at gmail.com
Mon May 11 18:30:54 CEST 2015


Hi there,

Triggered by a suspiciously slow join, I had a look at BATsubjoin, in
particular at how it is decided which algorithm to use.

I looked at a couple of cases with a join on strings.
This is one  (pseudo syntax):

l.count ~16M
l.T.sorted = 0
l.T.revsorted = 0

r.count = 1
r.T.sorted = 1
r.T.revsorted = 1

In this case, a hash table on r is created (because it's smaller).
This join takes *430ms*.
I forced swapping l and r, thus built the hash table on the larger bat, and
then it takes *0.8ms*.

Notice that though r.count = 1 is a bit of an extreme case, it does happen
often in practice. And the case with a few tuples would not be very
different.

The question I wanted to ask is:

is it always right to build the hash table on the small table?
Perhaps some more heuristics can help?
Like: if the smaller is very small, and the larger is not extremely large,
then hash on the larger one?

Roberto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.monetdb.org/pipermail/developers-list/attachments/20150511/7565278c/attachment.html>


More information about the developers-list mailing list