hash join - why hashing on the smaller bat?

Roberto Cornacchia roberto.cornacchia at gmail.com
Fri Apr 22 14:02:26 CEST 2016


Thanks Lefteris,

I am aware that what I encounter with my data / queries can be different
from typical tpch queries.
Perhaps simplistic, but what seems to work rock-solid for us, at the
moment, is this additional rule before the last one:

} else if (rcount < 128 && lpcount > rpcount) {
/* Spinque-specific: prefer to hash on the larger bat when the other one is
very small */
swap = 1;
reason = "left is very small";
} else if (lpcount < rpcount) {
/* no hashes, not sorted, create hash on smallest BAT */
swap = 1;
reason = "left is smaller";
}

The constant 128 was admittedly chosen quite arbitrarily, it could be less.
The joins we encounter most frequently with this pattern have sizes like
100M-500M against 1-5 tuples. All these joins perform consistently better
(~ 5x) when this rule is in place.
That is also why being able to reuse the investment made in hashing the
larger bat would be so important (the other question).

NB: I am not suggesting that our heuristics should become standard in
MonetDB - only that perhaps it is not such an uncommon pattern and it could
be worth some more thought.

Roberto


On 22 April 2016 at 13:04, Lefteris <lsidir at gmail.com> wrote:

> Hi Roberto,
>
> we are hashing on the smaller BAT to avoid many cache misses and going
> up and down on a huge bat that might not even fit in memory, where the
> smaller might fit even in some L3 or L2. Usually that is the case with
> dimension tables vs fact tables.
>
> I have also experiment with it, but my "mistake" is that I only test
> with tpch, and there it works that hashing on small bat is better. For
> your data if you see a difference of 4-5 times I think it makes sense
> to investigate.
>
> I think we have to arrange a meeting:)
>
> Lefteris
>
> On Thu, Apr 21, 2016 at 6:21 PM, Roberto Cornacchia
> <roberto.cornacchia at gmail.com> wrote:
> > Related to my previous question about persisting hashes, I would like to
> > throw another one.
> >
> > BATsubjoin has a series of heuristics to decide what type of join
> > implementation to use. When using hash-join, the latest rule says: if
> > nothing else applied, build a hash on the smaller bat.
> >
> > Could you tell me what is the rationale for this?
> >
> > From what I could verify:
> > - when sizes are comparable: it doesn't really make much difference which
> > side is hashed
> > - when sizes differ much: sure, building the hash table on that is much
> > cheaper, but the join as a whole becomes 4-5 times slower then when
> hashing
> > on the larger bat.
> >
> > In which case hashing on the larger bat is a good option?
> >
> > Cheers,
> > Roberto
> >
> > _______________________________________________
> > users-list mailing list
> > users-list at monetdb.org
> > https://www.monetdb.org/mailman/listinfo/users-list
> >
> _______________________________________________
> users-list mailing list
> users-list at monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.monetdb.org/pipermail/users-list/attachments/20160422/538189e7/attachment.html>


More information about the users-list mailing list