I'm don''t seem to get the expected result, let's see if I'm doing something silly.

- SQL signature:
  create function tokenize(id integer, s string, prob double)
  returns table (id integer, token string, prob double)
  external name batstr."UTF8tokenize";

- MAL signature:
  command batstr.UTF8tokenize(id:bat[:oid,:int],s:bat[:oid,:str],prob:bat[:oid,:dbl]) (:bat[:oid,:int],:bat[:oid,:str],:bat[:oid,:dbl])
  address STRbat_utf8_tokenize_id_prob;

- C signature:
  batstr_export str STRbat_utf8_tokenize_id_prob(bat *r1, bat *r2, bat *r3, const bat *idx, const bat *s, const bat *prob);


Inspecting a mal plan for a query like

  SELECT *
  FROM tokenize (select id, s, prob from x);

I see that the bat version of the function being used inside the same tuple-oriented loop. 

|     X_64 := bat.new(nil:oid,nil:int);
|     X_67 := bat.new(nil:oid,nil:str);
|     X_69 := bat.new(nil:oid,nil:dbl);
| barrier (X_72,X_73) := iterator.new(X_8);
|     X_75 := algebra.fetch(X_11,X_72);
|     X_77 := algebra.fetch(X_14,X_72);
|     (X_79,X_80,X_81) := batstr.UTF8tokenize(X_73,X_75,X_77);
|     bat.append(X_64,X_79);
|     bat.append(X_67,X_80);
|     bat.append(X_69,X_81);
|     redo (X_72,X_73) := iterator.next(X_8);
| exit (X_72,X_73);

Executing this fails, obviously.

Can you spot where the problem is?
Roberto


On 6 June 2015 at 10:29, Niels Nes <Niels.Nes@cwi.nl> wrote:
On Thu, Jun 04, 2015 at 11:46:08AM +0200, Martin Kersten wrote:
> On 04/06/15 11:36, Roberto Cornacchia wrote:
> >Hi Niels,
> >
> >I have tried this in default and indeed it does work like a charm.
> >(my UTF8tokenize UDF takes two values and outputs a 3-column table)
> >
> >I noticed, though that it results in a MAL loop:
> >
> >| barrier (X_72,X_73) := iterator.new(X_8);
> >|     X_75 := algebra.fetch(X_11,X_72);
> >|     X_77 := algebra.fetch(X_14,X_72);
> >|     (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77);
> >|     bat.append(X_64,X_79);
> >|     bat.append(X_67,X_80);
> >|     bat.append(X_69,X_81);
> >|     redo (X_72,X_73) := iterator.next(X_8);
> >| exit (X_72,X_73);
> >
> >This of course is not going to be efficient.
> >What if I write the bulk version of this function? Would that work?
> In general, yes. If a bulk version exist, this code would not be generated.
>
> str.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]
batstr.UTF8tokenize(X_73:bat[:oid,:str],X_75:bat[:oid,:str],X_77:bat[:oid,:str]):bat[:oid,:str]

Niels
>
> >And if it does, would it then also work in Oct2014, as it would no longer need the "union" trick?
> >
> >Roberto
> >
> >
> >On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl>> wrote:
> >
> >    On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote:
> >     > Hi there,
> >     >
> >     > I need a string tokenizer in MonetDB.
> >     > The problem I have is not with the function itself, but with the fact
> >     > that this is a 1 to N rows function.
> >     >
> >     > Implementing this for a single string value is easy enough, using a
> >     > table function that takes a string a returns a table:
> >     >
> >     > create function tokenize(s string)
> >     > returns table (token string)
> >     > external name tokenize;
> >     >
> >     > select *
> >     > from tokenize("one two three");
> >     >
> >     > That's fine.
> >     > The issue I'm having is with extending this to a column of strings.
> >     >
> >     > Ideally, given a string column
> >     >
> >     > one two three
> >     > four five six
> >     > seven eight
> >     >
> >     > I'd like to get an output along these lines (simplistic representation
> >     > here):
> >     >
> >     > one two three | one
> >     > one two three | two
> >     > one two three | three
> >     > four five six | four
> >     > four five six | five
> >     > four five six | six
> >     > seven eight   | seven
> >     > seven eight   | eight
> >     >
> >     >
> >     > I can sure code the c function and the mal wrapper to implement this,
> >     > but I can't see how to map it to SQL, given that table functions don't
> >     > accept identifiers as parameters.
> >     >
> >     > Any idea? Any possible workaround?
> >    In default you should be able to call tokenize on a column.
> >    It will output the 'union' of all per row calls.
> >    If you would like the 2 column output, you should take care of
> >    this in your tokenize function, ie return both input and token.
> >
> >    Niels
> >     > Thanks, Roberto
> >     >
> >
> >     > _______________________________________________
> >     > users-list mailing list
> >     > users-list@monetdb.org <mailto:users-list@monetdb.org>
> >     > https://www.monetdb.org/mailman/listinfo/users-list
> >
> >
> >    --
> >    Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI)
> >    Science Park 123, 1098 XG Amsterdam, The Netherlands
> >    room L3.14,  phone ++31 20 592-4098 <tel:%2B%2B31%2020%20592-4098> sip:4098@sip.cwi.nl <mailto:sip%3A4098@sip.cwi.nl>
> >    url: https://www.cwi.nl/people/niels    e-mail: Niels.Nes@cwi.nl <mailto:Niels.Nes@cwi.nl>
> >
> >    _______________________________________________
> >    users-list mailing list
> >    users-list@monetdb.org <mailto:users-list@monetdb.org>
> >    https://www.monetdb.org/mailman/listinfo/users-list
> >
> >
> >
> >
> >_______________________________________________
> >users-list mailing list
> >users-list@monetdb.org
> >https://www.monetdb.org/mailman/listinfo/users-list
> >
>
> _______________________________________________
> users-list mailing list
> users-list@monetdb.org
> https://www.monetdb.org/mailman/listinfo/users-list

--
Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI)
Science Park 123, 1098 XG Amsterdam, The Netherlands
room L3.14,  phone ++31 20 592-4098     sip:4098@sip.cwi.nl

_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list