Possible bug in pcre.c

Sjoerd Mullender sjoerd at monetdb.org
Mon Apr 8 16:44:31 CEST 2019


It's complicated stuff, but I think the code is right.

The implementation uses two nested for loops.  The outer loop iterates
over the right input (since the right input is column of regular
expressions, the expressions only get compiled once and matched multiple
times).  The inner loop iterates over the left input.  (This is
different from most other implementation of join that iterate over the
left input first.)  Because of this, we know the right output is sorted.

lastl is de last inserted vale in the left output.  It is used to
maintain the dense and sorted flags of the left output (tseqbase,
tsorted, and trevsorted).

BATcount(r1) > 0, so this means we've already seen a match, which in
turn means that we've been past the assignment "lastl = lo;" just past
the if statement.  Note that there is no other assignment to lastl (just
the initialization to 0 at the top of the function).

nl == 0 means that for the current value of the right input (outermost
for loop) we haven't yet inserted any matches, but since we're executing
where we are, there are matches for that current value and one or more
values in the left input.  Note that nl is set to 0 inside the outer for
loop, and gets incremented inside the inner for loop for each match that
was inserted.

So, we are now looking at a new value in the right column that has
matched a value in the left column.  There are three possibilities: the
left input is smaller than, equal to, or larger than the last value for
the left input that was inserted into the left output.

If smaller, the left output is not sorted (smaller value follows
larger).  Also, we don't know about uniqueness anymore (we only maintain
that if the left output is strictly ascending).
If equal, the left output is definitely not unique.
If larger, the left output is not reverse sorted (larger value follows
smaller).

Since we're inserting an output for a new iteration of the right input,
we know that the right output is not reverse sorted.  If nl were greater
than 0, we're inserting a subsequent match of the right input, so the
right input could still be reverse sorted, though not unique anymore.

I hope this clears things up.

On 08/04/2019 15.27, Roberto Cornacchia wrote:
> Hi,
> 
> I was looking at this piece of code (end of pcrejoin loop), and I have
> the feeling something isn't right.
> Inside the body of 
>   if (nl == 0) 
> a condition over lastlis checked.
> 
> However, it seems to me that lastl gets a useful value only when nl > 0.
> If (nl == 0) then lastl must be 0.
> 
> 
> <loop> {
>                         /* actual body */
>                         ....
> 
> 
>                         /* track result properties */
> if (BATcount(r1) > 0) {
> if (lastl + 1 != lo)
> r1->tseqbase = oid_nil;
> if (nl == 0) {
> r2->trevsorted = 0;
> if (lastl > lo) {
> r1->tsorted = 0;
> r1->tkey = 0;
> } else if (lastl < lo) {
> r1->trevsorted = 0;
> } else {
> r1->tkey = 0;
> }
> }
> }
> APPEND(r1, lo);
> APPEND(r2, ro);
> lastl = lo;
> nl++;
> }
> 
> 
> _______________________________________________
> developers-list mailing list
> developers-list at monetdb.org
> https://www.monetdb.org/mailman/listinfo/developers-list
> 


-- 
Sjoerd Mullender

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://www.monetdb.org/pipermail/developers-list/attachments/20190408/4673f071/attachment.sig>


More information about the developers-list mailing list