Drastic speed improvement for str.toLower() / str.toUpper
roberto.cornacchia at gmail.com
Mon Jan 18 10:52:50 CET 2016
You're right, your version is slightly faster in my test set.
On 14 January 2016 at 22:08, Sjoerd Mullender <sjoerd at monetdb.org> wrote:
> I've just checked a change in that is similar to what you did here.
> I've changed the order in which the tests are done, since I think my
> order is more efficient.
> On 01/14/2016 05:34 PM, Roberto Cornacchia wrote:
> > Hello,
> > I had reported in https://www.monetdb.org/bugzilla/show_bug.cgi?id=3549
> > that string case conversion is very inefficient in MonetDB.
> > I had a look at the code. For each UTF8 character it performs a hash
> > lookup from the origin case bat and finds the corresponding character in
> > the destination case bat.
> > However, this is an overkill for ASCII characters:
> > - for letters, [A-Z] + 32 = [a-z]
> > - all other ASCII characters stay the same
> > With the assumption that single-byte characters are very frequent in
> > most texts, it makes sense to invest in a simple test and perform the
> > hash lookup only for multi-byte characters.
> > I tested this on 831MB (over 360K tuples) of standard English text:
> > - original str.toLower/str.toUpper: 101 seconds (8 MB/s)
> > - modified version: 3.6 seconds (230 MB/s)
> > I guess that even when the text is highly multi-byte oriented the added
> > test wouldn't hurt that much.
> > A side-observation perhaps worth investigating is why that hash lookup
> > is so expensive.
> > Please find my patch in attachment.
> > Roberto
> > _______________________________________________
> > developers-list mailing list
> > developers-list at monetdb.org
> > https://www.monetdb.org/mailman/listinfo/developers-list
> Sjoerd Mullender
> developers-list mailing list
> developers-list at monetdb.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the developers-list