Hi there,
I need a string tokenizer in MonetDB. The problem I have is not with the function itself, but with the fact that this is a 1 to N rows function.
Implementing this for a single string value is easy enough, using a table function that takes a string a returns a table:
create function tokenize(s string) returns table (token string) external name tokenize;
select * from tokenize("one two three");
That's fine. The issue I'm having is with extending this to a column of strings.
Ideally, given a string column
one two three four five six seven eight
I'd like to get an output along these lines (simplistic representation here):
one two three | one one two three | two one two three | three four five six | four four five six | five four five six | six seven eight | seven seven eight | eight
I can sure code the c function and the mal wrapper to implement this, but I can't see how to map it to SQL, given that table functions don't accept identifiers as parameters.
Any idea? Any possible workaround? Thanks, Roberto
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote:
Hi there,
I need a string tokenizer in MonetDB. The problem I have is not with the function itself, but with the fact that this is a 1 to N rows function.
Implementing this for a single string value is easy enough, using a table function that takes a string a returns a table:
create function tokenize(s string) returns table (token string) external name tokenize;
select * from tokenize("one two three");
That's fine. The issue I'm having is with extending this to a column of strings.
Ideally, given a string column
one two three four five six seven eight
I'd like to get an output along these lines (simplistic representation here):
one two three | one one two three | two one two three | three four five six | four four five six | five four five six | six seven eight | seven seven eight | eight
I can sure code the c function and the mal wrapper to implement this, but I can't see how to map it to SQL, given that table functions don't accept identifiers as parameters.
Any idea? Any possible workaround?
In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token.
Niels
Thanks, Roberto
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Hi Niels,
That sounds perfect. I suppose you refer to this: http://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=5db56a1d5bc5
Do you think I would have any luck trying to port this back to Oct2014?
On 11 April 2015 at 14:06, Niels Nes Niels.Nes@cwi.nl wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote:
Hi there,
I need a string tokenizer in MonetDB. The problem I have is not with the function itself, but with the fact that this is a 1 to N rows function.
Implementing this for a single string value is easy enough, using a table function that takes a string a returns a table:
create function tokenize(s string) returns table (token string) external name tokenize;
select * from tokenize("one two three");
That's fine. The issue I'm having is with extending this to a column of strings.
Ideally, given a string column
one two three four five six seven eight
I'd like to get an output along these lines (simplistic representation here):
one two three | one one two three | two one two three | three four five six | four four five six | five four five six | six seven eight | seven seven eight | eight
I can sure code the c function and the mal wrapper to implement this, but I can't see how to map it to SQL, given that table functions don't accept identifiers as parameters.
Any idea? Any possible workaround?
In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token.
Niels
Thanks, Roberto
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
On Sat, Apr 11, 2015 at 03:24:42PM +0200, Roberto Cornacchia wrote:
Hi Niels,
That sounds perfect. I suppose you refer to this: http:// dev.monetdb.org/hg/MonetDB?cmd=changeset;node=5db56a1d5bc5
Do you think I would have any luck trying to port this back to Oct2014?
I checked it in on default as this is more a new feature and not a bug fix. But it should contain enough to backport it.
Niels
ps we are planning a release too...
On 11 April 2015 at 14:06, Niels Nes Niels.Nes@cwi.nl wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote: > Hi there, > > I need a string tokenizer in MonetDB. > The problem I have is not with the function itself, but with the fact > that this is a 1 to N rows function. > > Implementing this for a single string value is easy enough, using a > table function that takes a string a returns a table: > > create function tokenize(s string) > returns table (token string) > external name tokenize; > > select * > from tokenize("one two three"); > > That's fine. > The issue I'm having is with extending this to a column of strings. > > Ideally, given a string column > > one two three > four five six > seven eight > > I'd like to get an output along these lines (simplistic representation > here): > > one two three | one > one two three | two > one two three | three > four five six | four > four five six | five > four five six | six > seven eight | seven > seven eight | eight > > > I can sure code the c function and the mal wrapper to implement this, > but I can't see how to map it to SQL, given that table functions don't > accept identifiers as parameters. > > Any idea? Any possible workaround? In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token. Niels > Thanks, Roberto > > _______________________________________________ > users-list mailing list > users-list@monetdb.org > https://www.monetdb.org/mailman/listinfo/users-list -- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Super! I'll give it a try.
Thanks Niels.
On 11 April 2015 at 15:30, Niels Nes Niels.Nes@cwi.nl wrote:
On Sat, Apr 11, 2015 at 03:24:42PM +0200, Roberto Cornacchia wrote:
Hi Niels,
That sounds perfect. I suppose you refer to this: http:// dev.monetdb.org/hg/MonetDB?cmd=changeset;node=5db56a1d5bc5
Do you think I would have any luck trying to port this back to Oct2014?
I checked it in on default as this is more a new feature and not a bug fix. But it should contain enough to backport it.
Niels
ps we are planning a release too...
On 11 April 2015 at 14:06, Niels Nes Niels.Nes@cwi.nl wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote: > Hi there, > > I need a string tokenizer in MonetDB. > The problem I have is not with the function itself, but with the fact > that this is a 1 to N rows function. > > Implementing this for a single string value is easy enough, using a > table function that takes a string a returns a table: > > create function tokenize(s string) > returns table (token string) > external name tokenize; > > select * > from tokenize("one two three"); > > That's fine. > The issue I'm having is with extending this to a column of strings. > > Ideally, given a string column > > one two three > four five six > seven eight > > I'd like to get an output along these lines (simplistic representation > here): > > one two three | one > one two three | two > one two three | three > four five six | four > four five six | five > four five six | six > seven eight | seven > seven eight | eight > > > I can sure code the c function and the mal wrapper to implement this, > but I can't see how to map it to SQL, given that table functions don't > accept identifiers as parameters. > > Any idea? Any possible workaround? In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token. Niels > Thanks, Roberto > > _______________________________________________ > users-list mailing list > users-list@monetdb.org > https://www.monetdb.org/mailman/listinfo/users-list -- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Hi Niels,
I have tried this in default and indeed it does work like a charm. (my UTF8tokenize UDF takes two values and outputs a 3-column table)
I noticed, though that it results in a MAL loop:
| barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
This of course is not going to be efficient. What if I write the bulk version of this function? Would that work? And if it does, would it then also work in Oct2014, as it would no longer need the "union" trick?
Roberto
On 11 April 2015 at 14:06, Niels Nes Niels.Nes@cwi.nl wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote:
Hi there,
I need a string tokenizer in MonetDB. The problem I have is not with the function itself, but with the fact that this is a 1 to N rows function.
Implementing this for a single string value is easy enough, using a table function that takes a string a returns a table:
create function tokenize(s string) returns table (token string) external name tokenize;
select * from tokenize("one two three");
That's fine. The issue I'm having is with extending this to a column of strings.
Ideally, given a string column
one two three four five six seven eight
I'd like to get an output along these lines (simplistic representation here):
one two three | one one two three | two one two three | three four five six | four four five six | five four five six | six seven eight | seven seven eight | eight
I can sure code the c function and the mal wrapper to implement this, but I can't see how to map it to SQL, given that table functions don't accept identifiers as parameters.
Any idea? Any possible workaround?
In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token.
Niels
Thanks, Roberto
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
On 04/06/15 11:36, Roberto Cornacchia wrote:
Hi Niels,
I have tried this in default and indeed it does work like a charm. (my UTF8tokenize UDF takes two values and outputs a 3-column table)
I noticed, though that it results in a MAL loop:
| barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
This of course is not going to be efficient. What if I write the bulk version of this function? Would that work?
In general, yes. If a bulk version exist, this code would not be generated.
str.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]
And if it does, would it then also work in Oct2014, as it would no longer need the "union" trick?
Roberto
On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl> wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote: > Hi there, > > I need a string tokenizer in MonetDB. > The problem I have is not with the function itself, but with the fact > that this is a 1 to N rows function. > > Implementing this for a single string value is easy enough, using a > table function that takes a string a returns a table: > > create function tokenize(s string) > returns table (token string) > external name tokenize; > > select * > from tokenize("one two three"); > > That's fine. > The issue I'm having is with extending this to a column of strings. > > Ideally, given a string column > > one two three > four five six > seven eight > > I'd like to get an output along these lines (simplistic representation > here): > > one two three | one > one two three | two > one two three | three > four five six | four > four five six | five > four five six | six > seven eight | seven > seven eight | eight > > > I can sure code the c function and the mal wrapper to implement this, > but I can't see how to map it to SQL, given that table functions don't > accept identifiers as parameters. > > Any idea? Any possible workaround? In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token. Niels > Thanks, Roberto > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> > https://www.monetdb.org/mailman/listinfo/users-list -- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 <tel:%2B%2B31%2020%20592-4098> sip:4098@sip.cwi.nl <mailto:sip%3A4098@sip.cwi.nl> url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl> _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
On Thu, Jun 04, 2015 at 11:46:08AM +0200, Martin Kersten wrote:
On 04/06/15 11:36, Roberto Cornacchia wrote:
Hi Niels,
I have tried this in default and indeed it does work like a charm. (my UTF8tokenize UDF takes two values and outputs a 3-column table)
I noticed, though that it results in a MAL loop:
| barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
This of course is not going to be efficient. What if I write the bulk version of this function? Would that work?
In general, yes. If a bulk version exist, this code would not be generated.
str.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]
batstr.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]
Niels
And if it does, would it then also work in Oct2014, as it would no longer need the "union" trick?
Roberto
On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl> wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote: > Hi there, > > I need a string tokenizer in MonetDB. > The problem I have is not with the function itself, but with the fact > that this is a 1 to N rows function. > > Implementing this for a single string value is easy enough, using a > table function that takes a string a returns a table: > > create function tokenize(s string) > returns table (token string) > external name tokenize; > > select * > from tokenize("one two three"); > > That's fine. > The issue I'm having is with extending this to a column of strings. > > Ideally, given a string column > > one two three > four five six > seven eight > > I'd like to get an output along these lines (simplistic representation > here): > > one two three | one > one two three | two > one two three | three > four five six | four > four five six | five > four five six | six > seven eight | seven > seven eight | eight > > > I can sure code the c function and the mal wrapper to implement this, > but I can't see how to map it to SQL, given that table functions don't > accept identifiers as parameters. > > Any idea? Any possible workaround? In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token.
Niels > Thanks, Roberto >
> _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> > https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 tel:%2B%2B31%2020%20592-4098 sip:4098@sip.cwi.nl mailto:sip%3A4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl
users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
I'm don''t seem to get the expected result, let's see if I'm doing something silly.
- SQL signature: create function tokenize(id integer, s string, prob double) returns table (id integer, token string, prob double) external name batstr."UTF8tokenize";
- MAL signature: command batstr.UTF8tokenize(id:bat[:oid,:int],s:bat[:oid,:str],prob:bat[:oid,:dbl]) (:bat[:oid,:int],:bat[:oid,:str],:bat[:oid,:dbl]) address STRbat_utf8_tokenize_id_prob;
- C signature: batstr_export str STRbat_utf8_tokenize_id_prob(bat *r1, bat *r2, bat *r3, const bat *idx, const bat *s, const bat *prob);
Inspecting a mal plan for a query like
SELECT * FROM tokenize (select id, s, prob from x);
I see that the bat version of the function being used inside the same tuple-oriented loop.
| X_64 := bat.new(nil:oid,nil:int); | X_67 := bat.new(nil:oid,nil:str); | X_69 := bat.new(nil:oid,nil:dbl); | barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := batstr.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
Executing this fails, obviously.
Can you spot where the problem is? Roberto
On 6 June 2015 at 10:29, Niels Nes Niels.Nes@cwi.nl wrote:
On Thu, Jun 04, 2015 at 11:46:08AM +0200, Martin Kersten wrote:
On 04/06/15 11:36, Roberto Cornacchia wrote:
Hi Niels,
I have tried this in default and indeed it does work like a charm. (my UTF8tokenize UDF takes two values and outputs a 3-column table)
I noticed, though that it results in a MAL loop:
| barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
This of course is not going to be efficient. What if I write the bulk version of this function? Would that work?
In general, yes. If a bulk version exist, this code would not be
generated.
str.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]
batstr.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]
Niels
And if it does, would it then also work in Oct2014, as it would no
longer need the "union" trick?
Roberto
On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl <mailto:
Niels.Nes@cwi.nl>> wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote: > Hi there, > > I need a string tokenizer in MonetDB. > The problem I have is not with the function itself, but with the
fact
> that this is a 1 to N rows function. > > Implementing this for a single string value is easy enough,
using a
> table function that takes a string a returns a table: > > create function tokenize(s string) > returns table (token string) > external name tokenize; > > select * > from tokenize("one two three"); > > That's fine. > The issue I'm having is with extending this to a column of
strings.
> > Ideally, given a string column > > one two three > four five six > seven eight > > I'd like to get an output along these lines (simplistic
representation
> here): > > one two three | one > one two three | two > one two three | three > four five six | four > four five six | five > four five six | six > seven eight | seven > seven eight | eight > > > I can sure code the c function and the mal wrapper to implement
this,
> but I can't see how to map it to SQL, given that table functions
don't
> accept identifiers as parameters. > > Any idea? Any possible workaround?
In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token.
Niels > Thanks, Roberto >
> _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> > https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 tel:%2B%2B31%2020%20592-4098
sip:4098@sip.cwi.nl mailto:sip%3A4098@sip.cwi.nl
url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Clone the following (small) repository and read the README.rst file in there: http://dev.monetdb.org/hg/MonetDB-extend/
In short, you need to define the SQL scalar function which should point to the MAL function str.UTF8tokenize, and you need to have the MAL bulk function batstr.UTF8tokenize. If you have those, SQL should figure it all out. (In particular, you should *not* have the SQL bulk function.)
On 08/06/15 11:36, Roberto Cornacchia wrote:
I'm don''t seem to get the expected result, let's see if I'm doing something silly.
- SQL signature: create function tokenize(id integer, s string,
prob double) returns table (id integer, token string, prob double) external name batstr."UTF8tokenize";
- MAL signature: command
batstr.UTF8tokenize(id:bat[:oid,:int],s:bat[:oid,:str],prob:bat[:oid,:
dbl])
(:bat[:oid,:int],:bat[:oid,:str],:bat[:oid,:dbl])
address STRbat_utf8_tokenize_id_prob;
- C signature: batstr_export str STRbat_utf8_tokenize_id_prob(bat
*r1, bat *r2, bat *r3, const bat *idx, const bat *s, const bat *prob);
Inspecting a mal plan for a query like
SELECT * FROM tokenize (select id, s, prob from x);
I see that the bat version of the function being used inside the same tuple-oriented loop.
| X_64 := bat.new(nil:oid,nil:int); | X_67 := bat.new(nil:oid,nil:str); | X_69 := bat.new(nil:oid,nil:dbl); | barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := batstr.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
Executing this fails, obviously.
Can you spot where the problem is? Roberto
On 6 June 2015 at 10:29, Niels Nes <Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl> wrote:
On Thu, Jun 04, 2015 at 11:46:08AM +0200, Martin Kersten wrote:
On 04/06/15 11:36, Roberto Cornacchia wrote:
Hi Niels,
I have tried this in default and indeed it does work like a charm. (my UTF8tokenize UDF takes two values and outputs a 3-column table)
I noticed, though that it results in a MAL loop:
| barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
This of course is not going to be efficient. What if I write the bulk version of this function? Would that work?
In general, yes. If a bulk version exist, this code would not be generated.
str.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oi
d,:str]):bat[:oid,:str]
batstr.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oi d,:str]):bat[:oid,:str]
Niels
And if it does, would it then also work in Oct2014, as it would
no longer need the "union" trick?
Roberto
On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl
mailto:Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl>> wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia
wrote:
Hi there,
I need a string tokenizer in MonetDB. The problem I have is not with the function itself, but
with the fact
that this is a 1 to N rows function.
Implementing this for a single string value is easy
enough, using a
table function that takes a string a returns a table:
create function tokenize(s string) returns table (token string) external name tokenize;
select * from tokenize("one two three");
That's fine. The issue I'm having is with extending this to a column of
strings.
Ideally, given a string column
one two three four five six seven eight
I'd like to get an output along these lines (simplistic
representation
here):
one two three | one one two three | two one two three | three four five six | four four five six | five four five six | six seven eight | seven seven eight | eight
I can sure code the c function and the mal wrapper to
implement this,
but I can't see how to map it to SQL, given that table
functions don't
accept identifiers as parameters.
Any idea? Any possible workaround?
In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token.
Niels
Thanks, Roberto
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org
<mailto:users-list@monetdb.org mailto:users-list@monetdb.org>
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098
tel:%2B%2B31%2020%20592-4098 tel:%2B%2B31%2020%20592-4098 sip:4098@sip.cwi.nl mailto:sip%3A4098@sip.cwi.nl <mailto:sip%3A4098@sip.cwi.nl mailto:sip%253A4098@sip.cwi.nl>
url: https://www.cwi.nl/people/niels e-mail:
Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl>
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org
<mailto:users-list@monetdb.org mailto:users-list@monetdb.org>
https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 tel:%2B%2B31%2020%20592-4098 sip:4098@sip.cwi.nl mailto:sip%3A4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
- -- Sjoerd Mullender
Thanks Sjoerd,
What I posted before had indeed one mistake: the SQL scalar function was pointing to the MAL bulk function, not to the scalar one. Now I fixed it, but still the bulk version is not used (it exists and is defined in batstr, I believe with the right signature).
I have already a number of UDFs, with both the scalar and bulk implementations, and have no issue with those. Can it be that in this particular case (a function that takes a sub-select as input), the pattern for bulk version is not recognised?
Roberto
On 8 June 2015 at 13:48, Sjoerd Mullender sjoerd@acm.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Clone the following (small) repository and read the README.rst file in there: http://dev.monetdb.org/hg/MonetDB-extend/
In short, you need to define the SQL scalar function which should point to the MAL function str.UTF8tokenize, and you need to have the MAL bulk function batstr.UTF8tokenize. If you have those, SQL should figure it all out. (In particular, you should *not* have the SQL bulk function.)
On 08/06/15 11:36, Roberto Cornacchia wrote:
I'm don''t seem to get the expected result, let's see if I'm doing something silly.
- SQL signature: create function tokenize(id integer, s string,
prob double) returns table (id integer, token string, prob double) external name batstr."UTF8tokenize";
- MAL signature: command
batstr.UTF8tokenize(id:bat[:oid,:int],s:bat[:oid,:str],prob:bat[:oid,:
dbl])
(:bat[:oid,:int],:bat[:oid,:str],:bat[:oid,:dbl])
address STRbat_utf8_tokenize_id_prob;
- C signature: batstr_export str STRbat_utf8_tokenize_id_prob(bat
*r1, bat *r2, bat *r3, const bat *idx, const bat *s, const bat *prob);
Inspecting a mal plan for a query like
SELECT * FROM tokenize (select id, s, prob from x);
I see that the bat version of the function being used inside the same tuple-oriented loop.
| X_64 := bat.new(nil:oid,nil:int); | X_67 := bat.new(nil:oid,nil:str); | X_69 := bat.new(nil:oid,nil:dbl); | barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := batstr.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
Executing this fails, obviously.
Can you spot where the problem is? Roberto
On 6 June 2015 at 10:29, Niels Nes <Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl> wrote:
On Thu, Jun 04, 2015 at 11:46:08AM +0200, Martin Kersten wrote:
On 04/06/15 11:36, Roberto Cornacchia wrote:
Hi Niels,
I have tried this in default and indeed it does work like a charm. (my UTF8tokenize UDF takes two values and outputs a 3-column table)
I noticed, though that it results in a MAL loop:
| barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
This of course is not going to be efficient. What if I write the bulk version of this function? Would that work?
In general, yes. If a bulk version exist, this code would not be generated.
str.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oi
d,:str]):bat[:oid,:str]
batstr.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oi d,:str]):bat[:oid,:str]
Niels
And if it does, would it then also work in Oct2014, as it would
no longer need the "union" trick?
Roberto
On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl
mailto:Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl>> wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia
wrote:
Hi there,
I need a string tokenizer in MonetDB. The problem I have is not with the function itself, but
with the fact
that this is a 1 to N rows function.
Implementing this for a single string value is easy
enough, using a
table function that takes a string a returns a table:
create function tokenize(s string) returns table (token string) external name tokenize;
select * from tokenize("one two three");
That's fine. The issue I'm having is with extending this to a column of
strings.
Ideally, given a string column
one two three four five six seven eight
I'd like to get an output along these lines (simplistic
representation
here):
one two three | one one two three | two one two three | three four five six | four four five six | five four five six | six seven eight | seven seven eight | eight
I can sure code the c function and the mal wrapper to
implement this,
but I can't see how to map it to SQL, given that table
functions don't
accept identifiers as parameters.
Any idea? Any possible workaround?
In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token.
Niels
Thanks, Roberto
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org
<mailto:users-list@monetdb.org mailto:users-list@monetdb.org>
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098
tel:%2B%2B31%2020%20592-4098 tel:%2B%2B31%2020%20592-4098 sip:4098@sip.cwi.nl mailto:sip%3A4098@sip.cwi.nl <mailto:sip%3A4098@sip.cwi.nl mailto:sip%253A4098@sip.cwi.nl>
url: https://www.cwi.nl/people/niels e-mail:
Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl>
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org
<mailto:users-list@monetdb.org mailto:users-list@monetdb.org>
https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 tel:%2B%2B31%2020%20592-4098 sip:4098@sip.cwi.nl mailto:sip%3A4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Sjoerd Mullender -----BEGIN PGP SIGNATURE----- Version: GnuPG v2
iQEcBAEBCAAGBQJVdYEEAAoJEISMxT6LrWYgun4IAIU6hskhxHCgAF7+R1vAyoZC refsxd9voT4xOKuODBuc32NDlS96zotinoMTJ1i4hGCjueEuCY/ty8gF0kIQXNbY PEMQujcYmn74I21Wv8NrUfXQhpnNAhapHMuIY7O3n4MteDWUIwYy0QvxEWG0jSZv bzEDhRSnXhUmhMYrA/sKzkbQAdcHiYRO+ie+/iHcNQhvnF7Xo2Wq6ysTs+KyF7GF eGx1oRxArv9OJHsY8VRr1Ah5o9Dp09oAhDDzOl/aD9yAwQVYsmjkBm5IuG9mfpNk 2hDb3QJopFSXrpqgegj79wbrs1Wh8G0wPDa7Eq0cjd4eLAVsnDmmoKvkK4d6G14= =eS+c -----END PGP SIGNATURE----- _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Just to close the loop, I no longer see my problem. The signatures are now correct and the MAL plan is generated as expected. Perhaps I had run my last tests on the wrong MonetDB instance. No idea. Thanks for the pointer to the documentation, I hadn't seen it before. Roberto
On 8 June 2015 at 14:05, Roberto Cornacchia roberto.cornacchia@gmail.com wrote:
Thanks Sjoerd,
What I posted before had indeed one mistake: the SQL scalar function was pointing to the MAL bulk function, not to the scalar one. Now I fixed it, but still the bulk version is not used (it exists and is defined in batstr, I believe with the right signature).
I have already a number of UDFs, with both the scalar and bulk implementations, and have no issue with those. Can it be that in this particular case (a function that takes a sub-select as input), the pattern for bulk version is not recognised?
Roberto
On 8 June 2015 at 13:48, Sjoerd Mullender sjoerd@acm.org wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Clone the following (small) repository and read the README.rst file in there: http://dev.monetdb.org/hg/MonetDB-extend/
In short, you need to define the SQL scalar function which should point to the MAL function str.UTF8tokenize, and you need to have the MAL bulk function batstr.UTF8tokenize. If you have those, SQL should figure it all out. (In particular, you should *not* have the SQL bulk function.)
On 08/06/15 11:36, Roberto Cornacchia wrote:
I'm don''t seem to get the expected result, let's see if I'm doing something silly.
- SQL signature: create function tokenize(id integer, s string,
prob double) returns table (id integer, token string, prob double) external name batstr."UTF8tokenize";
- MAL signature: command
batstr.UTF8tokenize(id:bat[:oid,:int],s:bat[:oid,:str],prob:bat[:oid,:
dbl])
(:bat[:oid,:int],:bat[:oid,:str],:bat[:oid,:dbl])
address STRbat_utf8_tokenize_id_prob;
- C signature: batstr_export str STRbat_utf8_tokenize_id_prob(bat
*r1, bat *r2, bat *r3, const bat *idx, const bat *s, const bat *prob);
Inspecting a mal plan for a query like
SELECT * FROM tokenize (select id, s, prob from x);
I see that the bat version of the function being used inside the same tuple-oriented loop.
| X_64 := bat.new(nil:oid,nil:int); | X_67 := bat.new(nil:oid,nil:str); | X_69 := bat.new(nil:oid,nil:dbl); | barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := batstr.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
Executing this fails, obviously.
Can you spot where the problem is? Roberto
On 6 June 2015 at 10:29, Niels Nes <Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl> wrote:
On Thu, Jun 04, 2015 at 11:46:08AM +0200, Martin Kersten wrote:
On 04/06/15 11:36, Roberto Cornacchia wrote:
Hi Niels,
I have tried this in default and indeed it does work like a charm. (my UTF8tokenize UDF takes two values and outputs a 3-column table)
I noticed, though that it results in a MAL loop:
| barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73);
This of course is not going to be efficient. What if I write the bulk version of this function? Would that work?
In general, yes. If a bulk version exist, this code would not be generated.
str.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oi
d,:str]):bat[:oid,:str]
batstr.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oi d,:str]):bat[:oid,:str]
Niels
And if it does, would it then also work in Oct2014, as it would
no longer need the "union" trick?
Roberto
On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl
mailto:Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl>> wrote:
On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia
wrote:
Hi there,
I need a string tokenizer in MonetDB. The problem I have is not with the function itself, but
with the fact
that this is a 1 to N rows function.
Implementing this for a single string value is easy
enough, using a
table function that takes a string a returns a table:
create function tokenize(s string) returns table (token string) external name tokenize;
select * from tokenize("one two three");
That's fine. The issue I'm having is with extending this to a column of
strings.
Ideally, given a string column
one two three four five six seven eight
I'd like to get an output along these lines (simplistic
representation
here):
one two three | one one two three | two one two three | three four five six | four four five six | five four five six | six seven eight | seven seven eight | eight
I can sure code the c function and the mal wrapper to
implement this,
but I can't see how to map it to SQL, given that table
functions don't
accept identifiers as parameters.
Any idea? Any possible workaround?
In default you should be able to call tokenize on a column. It will output the 'union' of all per row calls. If you would like the 2 column output, you should take care of this in your tokenize function, ie return both input and token.
Niels
Thanks, Roberto
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org
<mailto:users-list@monetdb.org mailto:users-list@monetdb.org>
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098
tel:%2B%2B31%2020%20592-4098 tel:%2B%2B31%2020%20592-4098 sip:4098@sip.cwi.nl mailto:sip%3A4098@sip.cwi.nl <mailto:sip%3A4098@sip.cwi.nl mailto:sip%253A4098@sip.cwi.nl>
url: https://www.cwi.nl/people/niels e-mail:
Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl>
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org
<mailto:users-list@monetdb.org mailto:users-list@monetdb.org>
https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 tel:%2B%2B31%2020%20592-4098 sip:4098@sip.cwi.nl mailto:sip%3A4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl mailto:Niels.Nes@cwi.nl
_______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Sjoerd Mullender -----BEGIN PGP SIGNATURE----- Version: GnuPG v2
iQEcBAEBCAAGBQJVdYEEAAoJEISMxT6LrWYgun4IAIU6hskhxHCgAF7+R1vAyoZC refsxd9voT4xOKuODBuc32NDlS96zotinoMTJ1i4hGCjueEuCY/ty8gF0kIQXNbY PEMQujcYmn74I21Wv8NrUfXQhpnNAhapHMuIY7O3n4MteDWUIwYy0QvxEWG0jSZv bzEDhRSnXhUmhMYrA/sKzkbQAdcHiYRO+ie+/iHcNQhvnF7Xo2Wq6ysTs+KyF7GF eGx1oRxArv9OJHsY8VRr1Ah5o9Dp09oAhDDzOl/aD9yAwQVYsmjkBm5IuG9mfpNk 2hDb3QJopFSXrpqgegj79wbrs1Wh8G0wPDa7Eq0cjd4eLAVsnDmmoKvkK4d6G14= =eS+c -----END PGP SIGNATURE----- _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (4)
-
Martin Kersten
-
Niels Nes
-
Roberto Cornacchia
-
Sjoerd Mullender