[Monetdb-developers] hashjoin and strHash

Lefteris lsidir at gmail.com
Fri Dec 18 17:01:07 CET 2009

Hi Wouter,

funny think, I had the same exact problem and we were thinking about
this issue. The idea here is that this observation for strings might
not be always true, and it is a situation that cannot be always
determined on the kernel level. Correct me if I am wrong, but your
benefit on query comes because the hash in the large BAT is already
there, that's why the second time you get 0.01? You mention hot run so
I assume the BAT is already there with a hash index. While in the
original situation the hash is on the small BAT thus you don't benefit
from the hot run. But if a big BAT of strings is to be used again it
is unknown in the gdk level. So, I solved the problem by forcing the
hash index on the big BAT in a higher level (in Monet5) where it knows
something more about the application (in my case RDF store). Can you
do instead that? force the hash index in a higher level for you
application? If gdk see a hash index already there, then it will
choose that independent of the size.


On Fri, Dec 18, 2009 at 4:22 PM, Wouter Alink <wouter.alink at gmail.com> wrote:
> Dear developers,
> I would like to propose a change in GDK and hear opinions. It is about
> the following issue:
> in the BATjoin code, if there is no possibility to do a fetch or merge
> join, a hashjoin is performed. A hashtable is created for the smallest
> BAT. The reasons (i could think of) for choosing the smallest BAT for
> the hashtable are that less space is required for the hashtable (which
> in turn causes less cache misses when doing a lookup) and also because
> the hashfunction used is assumed to be very inexpensive (it needs to
> be calculated for each item in the large bat each time a join is
> performed).
> I can see that the hashfunction can be very efficient for data types
> without indirection, but I feel that for data types like strings in
> some cases this is a little different. If a string BAT for example
> contains many different values (i.e. is not a bat which contains
> enumeration values) the hashfunction will not be inexpensive anymore
> (many cache misses), as each hashfunction call needs to hash a whole
> (arbitrary long) string at an arbitrary place in the heap.
> Is it perhaps possible to specify that, when a BAT of type 'str' has
> many different values a hashtable may be build on the large BAT
> instead of on the small BAT?
> Reason that I ask this: I was analysing costs of a query in which I
> had a few short strings (26 tuples, 1-column table: varchar) which I
> wanted to look up in a dictionary (9M tuples, 2-column table:
> int,varchar).  "SELECT a.id FROM longlist AS a JOIN smalllist as b ON
> a.strvalue=b.strvalue;"
> The result is a small list of integers (26 or less tuples). This
> operation currently takes roughly 1.5 seconds for a hot run, mostly
> due to 9M strHash operations. By applying the patch below the
> execution time for a hot run dropped down to .01 seconds. The
> performance gain is caused by only having to perform strHash on the
> items in the small bat once the hashtable for the large bat has been
> created.
> Any suggestions whether such a change is useful? Which benchmarks will
> be influenced?
> I guess this code change is probably not useful for large string BATs
> with only few different values, but perhaps a guess could be made how
> diverse the strings in a bat are (by taking a sample or perhaps simply
> by looking at the ratio batsize/heapsize), and based on that determine
> whether to build it on the large or small BAT?
> Greetings,
> Wouter
> Index: src/gdk/gdk_relop.mx
> ===================================================================
> RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_relop.mx,v
> retrieving revision
> diff -u -r1.167.2.4 gdk_relop.mx
> --- src/gdk/gdk_relop.mx        20 Nov 2009 13:04:06 -0000
> +++ src/gdk/gdk_relop.mx        18 Dec 2009 14:59:13 -0000
> @@ -1232,7 +1232,12 @@
>  @-
>  hash join: the bread&butter join of monet
>  @c
> -       /* Simple rule, always build hash on the smallest */
> +       /* Simple rule, always build hash on the smallest,
> +                except when it is a string-join, then we do the opposite */
> +       if (swap && rcount < lcount && l->ttype == TYPE_str) {
> +               ALGODEBUG THRprintf(GDKout, "#BATjoin:
> BATmirror(BAThashjoin(BATmirror(r), BATmirror(l)," BUNFMT "));\n",
> estimate);
> +               return BATmirror(BAThashjoin(BATmirror(r),
> BATmirror(l), estimate));
> +       }
>        if (swap && rcount > lcount) {
>                ALGODEBUG THRprintf(GDKout, "#BATjoin:
> BATmirror(BAThashjoin(BATmirror(r), BATmirror(l)," BUNFMT "));\n",
> estimate);
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev
> _______________________________________________
> Monetdb-developers mailing list
> Monetdb-developers at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/monetdb-developers

More information about the developers-list mailing list