Drastic speed improvement for str.toLower() / str.toUpper

Roberto Cornacchia roberto.cornacchia at gmail.com
Thu Jan 14 18:29:04 CET 2016


A correction.

By mistake, I had measured the original version with sequential_pipe and
the modified version with default_pipe, thus not fair.

However, now it becomes even more interesting.

Original:
- sequential_pipe: 101s (8MB/s)
- default_pipe: 214s (3.9 MB/s)

Modified:
- sequential_pipe: 15s (55 MB/s)
- default_pipe: 3.6s (230 MB/s)

So parallel execution hurts the original version, while it speeds up the
modified version.

Roberto

On 14 January 2016 at 17:34, Roberto Cornacchia <
roberto.cornacchia at gmail.com> wrote:

> Hello,
>
> I had reported in https://www.monetdb.org/bugzilla/show_bug.cgi?id=3549
> that string case conversion is very inefficient in MonetDB.
>
> I had a look at the code.  For each UTF8 character it performs a hash
> lookup from the origin case bat and finds the corresponding character in
> the destination case bat.
>
> However, this is an overkill for ASCII characters:
> - for letters, [A-Z] + 32 = [a-z]
> - all other ASCII characters stay the same
>
> With the assumption that single-byte characters are very frequent in most
> texts, it makes sense to invest in a simple test and perform the hash
> lookup only for multi-byte characters.
>
> I tested this on 831MB (over 360K tuples) of standard English text:
> - original str.toLower/str.toUpper: 101 seconds (8 MB/s)
> - modified version: 3.6 seconds (230 MB/s)
>
> I guess that even when the text is highly multi-byte oriented the added
> test wouldn't hurt that much.
>
> A side-observation perhaps worth investigating is why that hash lookup is
> so expensive.
>
> Please find my patch in attachment.
>
> Roberto
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.monetdb.org/pipermail/developers-list/attachments/20160114/d543fcd8/attachment.html>


More information about the developers-list mailing list