Bug 6533

Summary: SIMILAR TO regular expression matching [patch]
Product: SQL Reporter: Richard Hughes <richard.monetdb>
Component: allAssignee: SQL devs <bugs-sql>
Status: NEW ---    
Severity: enhancement    
Priority: Normal    
Version: 11.27.11 (Jul2017-SP3)   
Hardware: x86_64 (amd64/em64t)   
OS: Linux   
Attachments: Implementation v1

Description Richard Hughes 2018-02-05 15:59:15 CET
Created attachment 592 [details]
Implementation v1

The attached is a first attempt at a patch implementing support for SIMILAR TO expressions: "select count(*) from t where v similar to '%(b|d)%';"

My specific motivating use-case was queries of the form "select count(*) from big where v like ? or v like ? or v like ? or ...". Those are slower than I'd like because they perform N passes over the v column; the neatest way I could figure out to make MonetDB perform only one pass was to add regexes and thereby get alternation support.

Points worth mentioning (in no particular order):

1) I don't have a copy of the ANSI SQL standard, so I worked off the PostgreSQL documentation (https://www.postgresql.org/docs/10/static/functions-matching.html#FUNCTIONS-SIMILARTO-REGEXP) excluding the substring() stuff.

2) That documentation doesn't talk about what exactly is supported in [] character classes. I have not filtered that part of the expression at all, so it's going to support the normal [^a-z] stuff and [:digit:] and the Perl extension [:^digit:]. It's highly unlikely that ANSI intended that last one to work, but I didn't see any harm in going beyond the standard there.

3) In fact, the whole similar2pcre() implementation is really scary. That needs to be reviewed very carefully to ensure I haven't managed to let any security problems through.

4) I went with the "similar to" syntax because it's the most standardish syntax I know of. I presume a function-based syntax would also work, however it would look slightly weird in the optimizers to put in a dependency on a specific extension. It may be worth observing that the always-reliable 'some random guy on Stack Overflow' says "SIMILAR TO is only supported in PostgreSQL because it ended up in early drafts of the SQL standard. They still haven't gotten rid of it. But there are plans to remove it and include regexp matches instead - or so I heard." (https://dba.stackexchange.com/questions/10694/pattern-matching-with-like-similar-to-or-regular-expressions-in-postgresql/10696#10696).

5) There aren't any tests in this patch. I presume PostgreSQL have some that we can steal but I haven't looked in to the process for importing them in to the MonetDB test suite.

6) I actually want case-insensitive matching. I'm planning to emulate that in my client application using the internationalizationally highly-dodgy '[tT][eE][sS][tT]', but if somebody can suggest a neat way to add an 'isimilar to' feature then I'm interested. My opinion is that it's not a good idea to add exactly that spelling because it's excessively non-standard.

7) I haven't tested the non-PCRE (i.e. libc regexec()) code path at all. I don't even know if it compiles.

8) I'm not clear on the policy which has been used to decide how many MAL functions to add. I've done the exact same entry points as LIKE so that the SQL and optimizer code can be common between the two, but it does look a little excessive versus simply having one entry point which is given all the necessary parameters.

9) I'm not confident about the code I've added to sql_upgrades.c. I need to look more carefully about how that file decides what code to run when. I'm also concerned about what happens if I maintain a local copy of this patch for an extended amount of time before it gets upstreamed by you guys.

10) I haven't implemented a LIKEjoin equivalent. It's not one of my use cases and it doesn't seem like it would be a common use case for other people and I hope (but haven't tested) that queries requiring it would still work, albeit slowly.

11) Since my use case was making things faster, I've also enabled the PCRE JIT. Some quick benchmarking suggests that this is about 55% of the total runtime of the non-JIT. I haven't benchmarked the cost of compiling the JIT and the TLB flush(es) involved. We could benchmark and calculate the threshold of the number of rows which are being scanned to make the overhead worth it. There's a long section in the PCRE docs titled "Availability of JIT support" which points out that a compile-time check is not necessarily easy to do; we could either add it to autoconf or accept that PCRE JIT support has been available for many years and it's unlikely that anybody would switch it off.

12) I've used PCRE_NO_UTF8_CHECK. This gives a 25% speed boost (i.e. 75% of the runtime) for non-JIT and is implied/mandatory for pcre_jit_exec() but comes with the warning "When it is set, the effect of passing an invalid UTF-8 string as a pattern is undefined. It may cause your program to crash or loop.". On the occasions where I've accidentally fed bogus data to MonetDB it has always caught any malformed UTF-8 but I'm not certain that this is absolutely guaranteed everywhere (e.g. select '\201\202').