On Wed, 26 Aug 2020, 16:42 Roberto Cornacchia, <roberto.cornacchia@gmail.com> wrote:

This is a question about join order. In general, about how it changed from Nov2019 to Jun2020 releases.
In particular, with respect to custom joins (filter functions).

With a schema:
CREATE TABLE t1(s string);
CREATE TABLE t2(s string);

Consider the following 2 queries, which only differ for having swapped conditions:
Q1:
SELECT t1.s, t2.s
FROM t1, t2
WHERE t1.s <> t2.s
AND [t1.s] maxlevenshtein [t2.s, 1];

Q2:
SELECT t1.s, t2.s
FROM t1, t2
WHERE [t1.s] maxlevenshtein [t2.s, 1]
AND t1.s <> t2.s;

[t1.s] maxlevenshtein [t2.s, 1] is equivalent to levenshtein(t1.s, t2.s) <=1
(i.e. the two strings have a levenshtein distance at most 1)
This is a relatively expensive and selective function.

In Nov2019, both Q1 and Q2 are translated to:
- "maxlevenshtein" custom join
- "!=" selection on the result

In Jun2020, the two queries happen to be evaluated in the same order as they are written. Which means that Q2 is evaluated as:
- "!=" join
- "maxlevenshtein" selection

This last evaluation plan is unfortunately not viable at all. The first join is not very selective, and the "maxlevenshtein" selection is run on way too many pairs, without the optimizations that can be exploited in a join (in a join, it is possible to skip the actual levenshtein computation for most combinations). Q2 in Jun2020 is 2 orders of magnitude slower than Q1, which quickly leads to unreasonably long query times.

Of course, this is just one specific case. A very unfortunate one, due to the combination of a couple of factors:
- The "!=" join is not selective enough
- The custom function is an expensive one

I guess my real questions are:
- Is it by chance (or, as a by-product of more generic join ordering rules) that Nov2019 executes custom joins first in both Q1 and Q2, or was it an intentional choice to first execute custom joins?
- What would a reasonable approach be?
Is it reasonable to assume that if one writes a custom join, this is expected to use an expensive comparison function and that the join implementation can be much more efficient than the selection implementation (by skipping unnecessary comparisons)?
If no assumptions can be made, can there be a way to annotate custom implementations with information on selectivity and cost?

Thanks for you input,
Roberto