Hashjoin performance with large vs small tables

Stefan Manegold Stefan.Manegold at cwi.nl
Mon May 11 18:40:17 CEST 2015


Hi Roberto,

thanks for reporting!

This need detailed profiling to understand what's happening.
Why are 16M look-ups to a 1-tuple hash table 500x more expensive
than building a 16M hash table (assuming that is it indeed built on the fly,
i.e., does not already exist), and then performing a single look-up ... ?

Stefan


----- On May 11, 2015, at 6:30 PM, Roberto Cornacchia roberto.cornacchia at gmail.com wrote:

> Hi there,
> 
> Triggered by a suspiciously slow join, I had a look at BATsubjoin, in particular
> at how it is decided which algorithm to use.
> 
> I looked at a couple of cases with a join on strings.
> This is one (pseudo syntax):
> 
> l.count ~16M
> l.T.sorted = 0
> l.T.revsorted = 0
> 
> r.count = 1
> r.T.sorted = 1
> r.T.revsorted = 1
> 
> In this case, a hash table on r is created (because it's smaller).
> This join takes 430ms .
> I forced swapping l and r, thus built the hash table on the larger bat, and then
> it takes 0.8ms .
> 
> Notice that though r.count = 1 is a bit of an extreme case, it does happen often
> in practice. And the case with a few tuples would not be very different.
> 
> The question I wanted to ask is:
> 
> is it always right to build the hash table on the small table?
> Perhaps some more heuristics can help?
> Like: if the smaller is very small, and the larger is not extremely large, then
> hash on the larger one?
> 
> Roberto
> 
> 
> _______________________________________________
> developers-list mailing list
> developers-list at monetdb.org
> https://www.monetdb.org/mailman/listinfo/developers-list

-- 
| Stefan.Manegold at CWI.nl | DB Architectures   (DA) |
| www.CWI.nl/~manegold/  | Science Park 123 (L321) |
| +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |


More information about the developers-list mailing list