Drastic speed improvement for str.toLower() / str.toUpper

Roberto Cornacchia roberto.cornacchia at gmail.com
Thu Jan 14 17:34:32 CET 2016


I had reported in https://www.monetdb.org/bugzilla/show_bug.cgi?id=3549
that string case conversion is very inefficient in MonetDB.

I had a look at the code.  For each UTF8 character it performs a hash
lookup from the origin case bat and finds the corresponding character in
the destination case bat.

However, this is an overkill for ASCII characters:
- for letters, [A-Z] + 32 = [a-z]
- all other ASCII characters stay the same

With the assumption that single-byte characters are very frequent in most
texts, it makes sense to invest in a simple test and perform the hash
lookup only for multi-byte characters.

I tested this on 831MB (over 360K tuples) of standard English text:
- original str.toLower/str.toUpper: 101 seconds (8 MB/s)
- modified version: 3.6 seconds (230 MB/s)

I guess that even when the text is highly multi-byte oriented the added
test wouldn't hurt that much.

A side-observation perhaps worth investigating is why that hash lookup is
so expensive.

Please find my patch in attachment.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.monetdb.org/pipermail/developers-list/attachments/20160114/f5f2517f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: convertCase.c.patch
Type: text/x-patch
Size: 1190 bytes
Desc: not available
URL: <http://www.monetdb.org/pipermail/developers-list/attachments/20160114/f5f2517f/attachment.bin>

More information about the developers-list mailing list