hash join - why hashing on the smaller bat?

Lefteris lsidir at gmail.com
Fri Apr 22 13:04:54 CEST 2016


Hi Roberto,

we are hashing on the smaller BAT to avoid many cache misses and going
up and down on a huge bat that might not even fit in memory, where the
smaller might fit even in some L3 or L2. Usually that is the case with
dimension tables vs fact tables.

I have also experiment with it, but my "mistake" is that I only test
with tpch, and there it works that hashing on small bat is better. For
your data if you see a difference of 4-5 times I think it makes sense
to investigate.

I think we have to arrange a meeting:)

Lefteris

On Thu, Apr 21, 2016 at 6:21 PM, Roberto Cornacchia
<roberto.cornacchia at gmail.com> wrote:
> Related to my previous question about persisting hashes, I would like to
> throw another one.
>
> BATsubjoin has a series of heuristics to decide what type of join
> implementation to use. When using hash-join, the latest rule says: if
> nothing else applied, build a hash on the smaller bat.
>
> Could you tell me what is the rationale for this?
>
> From what I could verify:
> - when sizes are comparable: it doesn't really make much difference which
> side is hashed
> - when sizes differ much: sure, building the hash table on that is much
> cheaper, but the join as a whole becomes 4-5 times slower then when hashing
> on the larger bat.
>
> In which case hashing on the larger bat is a good option?
>
> Cheers,
> Roberto
>
> _______________________________________________
> users-list mailing list
> users-list at monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>


More information about the users-list mailing list