I'm don''t seem to get the expected result, let's see if I'm doing something silly.
- SQL signature: create function tokenize(id integer, s string, prob double) returns table (id integer, token string, prob double) external name batstr."UTF8tokenize";
- MAL signature: command batstr.UTF8tokenize(id:bat[:oid,:int],s:bat[:oid,:str],prob:bat[:oid,:dbl]) (:bat[:oid,:int],:bat[:oid,:str],:bat[:oid,:dbl]) address STRbat_utf8_tokenize_id_prob;
- C signature: batstr_export str STRbat_utf8_tokenize_id_prob(bat *r1, bat *r2, bat *r3, const bat *idx, const bat *s, const bat *prob);
Inspecting a mal plan for a query like
SELECT * FROM tokenize (select id, s, prob from x);
I see that the bat version of the function being used inside the same tuple-oriented loop.
| X_64 := bat.new(nil:oid,nil:int); | X_67 := bat.new(nil:oid,nil:str); | X_69 := bat.new(nil:oid,nil:dbl); | barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := batstr.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
Executing this fails, obviously.
Can you spot where the problem is? Roberto
On 6 June 2015 at 10:29, Niels Nes Niels.Nes@cwi.nl wrote:
On Thu, Jun 04, 2015 at 11:46:08AM +0200, Martin Kersten wrote:
On 04/06/15 11:36, Roberto Cornacchia wrote:
Hi Niels,
I have tried this in default and indeed it does work like a charm. (my UTF8tokenize UDF takes two values and outputs a 3-column table)
I noticed, though that it results in a MAL loop:
| barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
This of course is not going to be efficient. What if I write the bulk version of this function? Would that work?
In general, yes. If a bulk version exist, this code would not be
generated.
str.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]
batstr.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]
Niels
And if it does, would it then also work in Oct2014, as it would no
longer need the "union" trick?
Roberto
On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl <mailto:
Niels.Nes@cwi.nl>> wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote: > Hi there, > > I need a string tokenizer in MonetDB. > The problem I have is not with the function itself, but with the
fact
> that this is a 1 to N rows function. > > Implementing this for a single string value is easy enough,
using a
> table function that takes a string a returns a table: > > create function tokenize(s string) > returns table (token string) > external name tokenize; > > select * > from tokenize("one two three"); > > That's fine. > The issue I'm having is with extending this to a column of
strings.
> > Ideally, given a string column > > one two three > four five six > seven eight > > I'd like to get an output along these lines (simplistic
representation
> here): > > one two three | one > one two three | two > one two three | three > four five six | four > four five six | five > four five six | six > seven eight | seven > seven eight | eight > > > I can sure code the c function and the mal wrapper to implement
this,
> but I can't see how to map it to SQL, given that table functions
don't
> accept identifiers as parameters. > > Any idea? Any possible workaround?
In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token.
Niels > Thanks, Roberto >
> _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> > https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 tel:%2B%2B31%2020%20592-4098
sip:4098@sip.cwi.nl mailto:sip%3A4098@sip.cwi.nl
url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list