Drastic speed improvement for str.toLower() / str.toUpper

Sjoerd Mullender sjoerd at monetdb.org
Thu Jan 14 22:08:00 CET 2016


I've just checked a change in that is similar to what you did here.
I've changed the order in which the tests are done, since I think my
order is more efficient.

On 01/14/2016 05:34 PM, Roberto Cornacchia wrote:
> Hello,
> 
> I had reported in https://www.monetdb.org/bugzilla/show_bug.cgi?id=3549
> that string case conversion is very inefficient in MonetDB.
> 
> I had a look at the code.  For each UTF8 character it performs a hash
> lookup from the origin case bat and finds the corresponding character in
> the destination case bat.
> 
> However, this is an overkill for ASCII characters:
> - for letters, [A-Z] + 32 = [a-z]
> - all other ASCII characters stay the same
> 
> With the assumption that single-byte characters are very frequent in
> most texts, it makes sense to invest in a simple test and perform the
> hash lookup only for multi-byte characters.
> 
> I tested this on 831MB (over 360K tuples) of standard English text:
> - original str.toLower/str.toUpper: 101 seconds (8 MB/s)
> - modified version: 3.6 seconds (230 MB/s)
> 
> I guess that even when the text is highly multi-byte oriented the added
> test wouldn't hurt that much.
> 
> A side-observation perhaps worth investigating is why that hash lookup
> is so expensive.
> 
> Please find my patch in attachment.
> 
> Roberto
> 
> 
> 
> _______________________________________________
> developers-list mailing list
> developers-list at monetdb.org
> https://www.monetdb.org/mailman/listinfo/developers-list
> 

-- 
Sjoerd Mullender

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 473 bytes
Desc: OpenPGP digital signature
URL: <http://www.monetdb.org/pipermail/developers-list/attachments/20160114/b45dc270/attachment.sig>


More information about the developers-list mailing list