hash join - why hashing on the smaller bat?

Lefteris lsidir at gmail.com
Fri Apr 22 14:35:45 CEST 2016


I agree Roberto, we should identify in a general way the cases where
building on larger is more beneficial and optimize for that.

Also, you should be able to reuse the hash, let me double check on
this and get back to you.

Lefteris

On Fri, Apr 22, 2016 at 2:02 PM, Roberto Cornacchia
<roberto.cornacchia at gmail.com> wrote:
> Thanks Lefteris,
>
> I am aware that what I encounter with my data / queries can be different
> from typical tpch queries.
> Perhaps simplistic, but what seems to work rock-solid for us, at the moment,
> is this additional rule before the last one:
>
> } else if (rcount < 128 && lpcount > rpcount) {
> /* Spinque-specific: prefer to hash on the larger bat when the other one is
> very small */
> swap = 1;
> reason = "left is very small";
> } else if (lpcount < rpcount) {
> /* no hashes, not sorted, create hash on smallest BAT */
> swap = 1;
> reason = "left is smaller";
> }
>
> The constant 128 was admittedly chosen quite arbitrarily, it could be less.
> The joins we encounter most frequently with this pattern have sizes like
> 100M-500M against 1-5 tuples. All these joins perform consistently better (~
> 5x) when this rule is in place.
> That is also why being able to reuse the investment made in hashing the
> larger bat would be so important (the other question).
>
> NB: I am not suggesting that our heuristics should become standard in
> MonetDB - only that perhaps it is not such an uncommon pattern and it could
> be worth some more thought.
>
> Roberto
>
>
> On 22 April 2016 at 13:04, Lefteris <lsidir at gmail.com> wrote:
>>
>> Hi Roberto,
>>
>> we are hashing on the smaller BAT to avoid many cache misses and going
>> up and down on a huge bat that might not even fit in memory, where the
>> smaller might fit even in some L3 or L2. Usually that is the case with
>> dimension tables vs fact tables.
>>
>> I have also experiment with it, but my "mistake" is that I only test
>> with tpch, and there it works that hashing on small bat is better. For
>> your data if you see a difference of 4-5 times I think it makes sense
>> to investigate.
>>
>> I think we have to arrange a meeting:)
>>
>> Lefteris
>>
>> On Thu, Apr 21, 2016 at 6:21 PM, Roberto Cornacchia
>> <roberto.cornacchia at gmail.com> wrote:
>> > Related to my previous question about persisting hashes, I would like to
>> > throw another one.
>> >
>> > BATsubjoin has a series of heuristics to decide what type of join
>> > implementation to use. When using hash-join, the latest rule says: if
>> > nothing else applied, build a hash on the smaller bat.
>> >
>> > Could you tell me what is the rationale for this?
>> >
>> > From what I could verify:
>> > - when sizes are comparable: it doesn't really make much difference
>> > which
>> > side is hashed
>> > - when sizes differ much: sure, building the hash table on that is much
>> > cheaper, but the join as a whole becomes 4-5 times slower then when
>> > hashing
>> > on the larger bat.
>> >
>> > In which case hashing on the larger bat is a good option?
>> >
>> > Cheers,
>> > Roberto
>> >
>> > _______________________________________________
>> > users-list mailing list
>> > users-list at monetdb.org
>> > https://www.monetdb.org/mailman/listinfo/users-list
>> >
>> _______________________________________________
>> users-list mailing list
>> users-list at monetdb.org
>> https://www.monetdb.org/mailman/listinfo/users-list
>
>
>
> _______________________________________________
> users-list mailing list
> users-list at monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>


More information about the users-list mailing list