Hi Stefan,

You are right, the selection on dict is unique so it also makes it sorted. 

See below some combinations of the same query run on sorted (S) vs. unsorted (U) tables ("sorted" refers to the join column).
Both tables use no selection, just a plain join between the two tables.

Counts are ~113M tuples fot termdoc, ~890K tuples for dict
dict always has a PK on termid.
Running on very recent Jul2015, 8threads default_pipe.

For each measurement, I took the best of 4 consecutive executions.


no sel / no FPtermdoc Utermdoc S
dict U1.8s2.0s
dict S2.3s1.5s

A note about the timings above: they fluctuated very much for the same query being run multiple times. Even up to 1s difference, and with no apparent trend. I still took the best of 4 consecutive executions.


no sel / FPtermdoc Utermdoc S
dict U1.7s1.8s
dict S1.7s1.8s

The obvious pattern seems that with FP constraints, the timings are much more stable. Good in general, but not exploiting sort information.


On 31 March 2016 at 10:33, Stefan Manegold <Stefan.Manegold@cwi.nl> wrote:
Hi Roberto,

what happens if you remove the filter on the primary side of the join,
i.e., the where clause in the subquery on dict?

Also, what happens if you sort table dict rather than table termdoc
(using query with and without selection)?

In other words, it the problem (only) the choice between using or not
the FK join index, or (also) related to pushing down or not the selection,
and/or choosing which of the join input it the build (inner) and which the
probe (outer) input?

Thanks!

Stefan

----- On Mar 31, 2016, at 10:20 AM, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:

> I see, thanks Sjoerd.
>
> As a follow up, I'd like to share this - I hope it can be interesting.
>
> The short summary is : I found that FP-key constraints can hurt performance (at
> least in this case, I didn't make a comprehensive test), while I had always
> assumed they would help.
>
> CREATE TABLE "dict" (
> "termid" INTEGER NOT NULL,
> "term" CHARACTER LARGE OBJECT NOT NULL,
> CONSTRAINT "dict" PRIMARY KEY ("termid"),
> );
>
> CREATE TABLE "termdoc" (
> "termid" INTEGER,
> "docid" INTEGER,
> CONSTRAINT "termdoc_termid_fkey" FOREIGN KEY ("termid") REFERENCES "dict"
> ("termid")
> );
>
> This is the query:
>
> select termid,docid
> from termdoc, (select termid from dict where termid=100) as q
> where termdoc.termid=q.termid;
>
> I list here 4 cases:
>
> 1) with FP constraint, table "termdoc" not sorted on termid.
>
> This query performs a foregign-primary key join. The explain shows indeed that
> the _idxbat is used.
> This executes in about 150ms (the exact hw/sw specs are not important, this
> timing is only used for comparison).
>
> 2) without FP constraint, table "termdoc" not sorted on termid.
> If I remove the constraint, the same query executes in about 250ms. That's what
> I would expect. The table wasn't sorted, so removing the constraints means
> hashing needs to take place.
>
> 3) without FP constraint, table "termdoc" sorted on termid.
> Still without constraint, but now the table is stored with "ORDER BY termid".
> This takes about 60ms. Which confirms what you explained earlier, Sjoerd. The
> bat isn't marked as sorted, but the join finds out and exploits that anyway.
> Good.
>
> 4) with FP constraint, table "termdoc" sorted on termid.
> Now I re-introduce the FP-key constraint, whith the table still sorted on
> "termid".
> This takes again 150ms. That is, the usage of the _idxbat has priority over the
> fact that we have a more useful sorting already. I think I understand why:
> using or not the FP-key constraint for the join is a plan-generation-time
> decision, while using the sort information is an execution-time decision -
> which is too late, the plan as already been written for the FP-key path. That
> is quite unfortunate.
>
> Which takes me to the question: couldn't that have been solved at
> plan-generation-time, if the sorting information had been marked in the
> persistent bat as a result of the ORDER BY clause in the CREATE TABLE
> statement? (I understand why that information is lost due to indirection in the
> plan, but I suspect it could be retained with some more effort if needed).
>
> Roberto.
>
>
>
> On 30 March 2016 at 19:14, Sjoerd Mullender < sjoerd@monetdb.org > wrote:
>
>
> It could be a feature.
>
> Nowadays, when having a sorted column could be beneficial for the
> implementation of an operator, we do a scan of the column to see whether
> it is sorted. This is then recorded in the column descriptor. Also
> when we then find out that the column is not sorted, this is also
> remembered. This means that we don't have to try quite as hard to find
> out that a column is sorted early on. Of course, if we can logically
> determine that the result of an operation is sorted, we record that.
>
> Given that the implementation of your query first produces a list of
> positions that indicate in which order the column is sorted and then in
> a separate operation produces this sorted list, we have lost the
> knowledge that the result is sorted on the way. But we will find out
> when we do other operations.
>
> On 03/30/2016 05:27 PM, Roberto Cornacchia wrote:
>> Before filing a bug report, I'd like to ask here whether this is
>> intentional/expected.
>>
>> sql>create table t(a int);
>> operation successful (0.650ms)
>> sql>insert into t values (1),(0),(3),(2);
>> 4 affected rows (0.638ms)
>> sql>
>> sql>create table sorted_t as select a from t order by a with data;
>> operation successful (0.807ms)
>> sql>
>> sql>select table,column,sorted from sys.storage() where "table"='sorted_t';
>> +----------+--------+--------+
>> | table | column | sorted |
>> +==========+========+========+
>> | sorted_t | a | false |
>> +----------+--------+--------+
>>
>> I am pretty sure it use to be that the sorting used in a "CREATE TABLE
>> AS ... WITH DATA" would result in the tsorted property to be true for
>> the (first) sorted column.
>>
>> Now this isn't the case. Is that a bug or a feature?
>>
>> Roberto
>>
>>
>> _______________________________________________
>> users-list mailing list
>> users-list@monetdb.org
>> https://www.monetdb.org/mailman/listinfo/users-list
>>
>
> --
> Sjoerd Mullender
>
>
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list
>
>
>
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list

--
| Stefan.Manegold@CWI.nl | DB Architectures   (DA) |
| www.CWI.nl/~manegold/  | Science Park 123 (L321) |
| +31 (0)20 592-4212     | 1098 XG Amsterdam  (NL) |
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list