Hashjoin performance with large vs small tables

Stefan Manegold Stefan.Manegold at cwi.nl
Tue May 12 12:30:36 CEST 2015


Roberto,

do you run a debug or an optimized build?

if the former, try the latter or at least disable assertions.

With a debug build (i.e., with assertions enabled)
you might meassure the costs for (potentially expensive) assertions,
rather than for pure execution ...

Stefan

----- On May 12, 2015, at 11:00 AM, Roberto Cornacchia roberto.cornacchia at gmail.com wrote:

> One more detail: I ran all tests with sequential_pipe
> 
> On 12 May 2015 at 10:58, Roberto Cornacchia < roberto.cornacchia at gmail.com >
> wrote:
> 
> 
> 
> 
> 
> On 11 May 2015 at 22:58, Stefan Manegold < Stefan.Manegold at cwi.nl > wrote:
> 
> 
> Roberto,
> 
> just to recap all facts:
> 
> - MonetDB Oct2014-SP3
> 
> - equality join between 2 string-BATs
> 
> - both BATs are persistent "base" BATs
> 
> Correct
> 
> 
> 
> - larger BAT has 16 M BUNs
> 
> 
> Apologies, I had misread the count here. It is 1.6M
> But this doesn't change things.
> 
> 
> 
> - smaller BAT has 1 BUN
> 
> - when (forcefully) building the hash on the larger one,
> and then performing a single probe from the smaller one,
> the first ("cold") join takes 30 ms (building the hash table),
> while any next ("hot") one takes 0.8 ms (re-using the pre-built hash table).
> This suggests ~29.2 ms for building the 16 M hash table
> and 0.8 ms for a single probe into that hash table.
> 
> 
> Correct (except 16M -> 1.6M)
> 
> 
> - when building the 1 BUN hash table on the smaller one,
> and then performing 16 M probes from the larger one,
> the (first?) ("cold"?) join takes 430 ms?
> 
> 
> Correct (except 16M -> 1.6M)
> 
> 
> + How long does a subsequent ("hot") join take (re-using the pre-built hash
> table)?
> 
> 
> Exactly the same, as expected. The pre-built hash table on the 1-tuple bat can
> hardly be useful
> 
> 
> 
> Could you run detailed profiling (e.g., using valgrind/callgrind) to analyze
> where
> the time goes in all 4 cases (hash on larger vs. hash on smaller & "cold" vs.
> "hot")?
> 
> 
> Could you share your data to reproduce and analyze the problem?
> 
> 
> I'm sending data and profiling by email.
> Thank you.
> 
> 
> 
> 
> Thanks!
> 
> Stefan
> 
> 
> ----- On May 11, 2015, at 6:49 PM, Roberto Cornacchia
> roberto.cornacchia at gmail.com wrote:
> 
>>> Also, those 430ms are not invested. The second time will still take 430ms. So
>>> hashing on a very small bat is never a good investment. On the contrary,
>>> hashing on a larger (but not too much) table is a good investment. The next
>>> time a similar query comes in, it will be sub-millisecond.
>> 
>> Well, this is a trade-off that in in general hard to judge.
>> If the bigger table / BAT is a base table/BAT, the hash table will (nowadays)
>> be made persistent and *could* be reused --- whether it indeed will be reused,
>> we cannot predict. If the bigger table is a transient intermediate result,
>> re-use is unlikely ...
>> 
>> 
>> That's fair.
>> 
>> 
>> Having said that, is your smaller table a base table or an intermediate result
>> that is (might be) a tiny slice of a large (huge) base table?
>> Then current code might build the hash on the entire parent BAT rather than on
>> the tiny slice ...
>> 
>> 
>> They both are base tables. The tiny table is created and a single insert is
>> done. The large one is also a regular table, with NOT NULL constraint on the
>> join column and the entire table is marked read-only.
>> 
>> 
>> 
>> Also: Which version of MonetDB are we talking about?
>> 
>> 
>> Oct2014 SP3
>> 
>> 
>> Stefan
>> 
>> --
>>| Stefan.Manegold at CWI.nl | DB Architectures (DA) |
>>| www.CWI.nl/~manegold/ | Science Park 123 (L321) |
>>| +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
>> _______________________________________________
>> developers-list mailing list
>> developers-list at monetdb.org
>> https://www.monetdb.org/mailman/listinfo/developers-list
>> 
>> 
>> _______________________________________________
>> developers-list mailing list
>> developers-list at monetdb.org
>> https://www.monetdb.org/mailman/listinfo/developers-list
> 
> --
>| Stefan.Manegold at CWI.nl | DB Architectures (DA) |
>| www.CWI.nl/~manegold/ | Science Park 123 (L321) |
>| +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
> _______________________________________________
> developers-list mailing list
> developers-list at monetdb.org
> https://www.monetdb.org/mailman/listinfo/developers-list
> 
> 
> 
> _______________________________________________
> developers-list mailing list
> developers-list at monetdb.org
> https://www.monetdb.org/mailman/listinfo/developers-list

-- 
| Stefan.Manegold at CWI.nl | DB Architectures   (DA) |
| www.CWI.nl/~manegold/  | Science Park 123 (L321) |
| +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |


More information about the developers-list mailing list