Hi,
After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates),
so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other languages.
The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance.
Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient:
*str* *UDFyearbracket(str *ret, const date *v)* *{* * if (*v == date_nil) {* * *ret = GDKstrdup(str_nil);* * } else {* * int year;* * fromdate(*v, NULL, NULL, &year);* * *ret = (str) GDKmalloc(15);* * sprintf(*ret, "%d", year);* * }* * return MAL_SUCCEED;* *}*
For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. *str* *UDFBATyearbracket(bat *ret, const bat *bid)* *{* * BAT *b, *bn;* * BUN i,n;* * str *y;* * const date *t;*
* if ((b = BATdescriptor(*bid)) == NULL)* * throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");* * n = BATcount(b);*
* bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);* * if (bn == NULL) {* * BBPunfix(b->batCacheid);* * throw(MAL, "UDF.BATyearbracket", "memory allocation failure");* * }* * bn->tnonil = 1;* * bn->tnil = 0;*
* t = (const date *) Tloc(b, 0);* * y = (str *) Tloc(bn, 0);* * for (i = 0; i < n; i++) {* * if (*t == date_nil) {* * *y = GDKstrdup(str_nil);* * } else* * UDFyearbracket(y, t);* * if (strcmp(*y, str_nil) == 0) {* * bn->tnonil = 0;* * bn->tnil = 1;* * }* * y++;* * t++;* * }*
* BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));*
* bn->tsorted = BATcount(bn)<2;* * bn->trevsorted = BATcount(bn)<2;*
* BBPkeepref(*ret = bn->batCacheid);* * BBPunfix(b->batCacheid);* * return MAL_SUCCEED;* *}*
PS: I am not a c expert but i can find my way with basic operations and pointers.
Any help or suggestions is appreciated.
Thank you.
Imad, I hope your success with this. Please comment if you get it, and then, could those new functions incorporate to future version of Monet? Or maybe easily compiled to current? So in the future users may suggest new useful functions (shame about SQL UDF performance)
Regards!
2016-12-28 14:48 GMT-03:00 imad hajj chahine imad.hajj.chahine@gmail.com:
Hi,
After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates),
so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other languages.
The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance.
Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient:
*str* *UDFyearbracket(str *ret, const date *v)* *{*
- if (*v == date_nil) {*
- *ret = GDKstrdup(str_nil);*
- } else {*
- int year;*
- fromdate(*v, NULL, NULL, &year);*
- *ret = (str) GDKmalloc(15);*
- sprintf(*ret, "%d", year);*
- }*
- return MAL_SUCCEED;*
*}*
For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. *str* *UDFBATyearbracket(bat *ret, const bat *bid)* *{*
BAT *b, *bn;*
BUN i,n;*
str *y;*
const date *t;*
if ((b = BATdescriptor(*bid)) == NULL)*
throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");*
n = BATcount(b);*
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);*
if (bn == NULL) {*
BBPunfix(b->batCacheid);*
throw(MAL, "UDF.BATyearbracket", "memory allocation failure");*
}*
bn->tnonil = 1;*
bn->tnil = 0;*
t = (const date *) Tloc(b, 0);*
y = (str *) Tloc(bn, 0);*
for (i = 0; i < n; i++) {*
if (*t == date_nil) {*
*y = GDKstrdup(str_nil);*
} else*
UDFyearbracket(y, t);*
if (strcmp(*y, str_nil) == 0) {*
bn->tnonil = 0;*
bn->tnil = 1;*
}*
y++;*
t++;*
}*
BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));*
bn->tsorted = BATcount(bn)<2;*
bn->trevsorted = BATcount(bn)<2;*
BBPkeepref(*ret = bn->batCacheid);*
BBPunfix(b->batCacheid);*
return MAL_SUCCEED;*
*}*
PS: I am not a c expert but i can find my way with basic operations and pointers.
Any help or suggestions is appreciated.
Thank you.
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
See https://dev.monetdb.org/hg/MonetDB-extend/ for a tutorial on how to create a UDF in C. You can use the URL to clone from.
On 12/28/2016 09:28 PM, Alberto Ferrari wrote:
Imad, I hope your success with this. Please comment if you get it, and then, could those new functions incorporate to future version of Monet? Or maybe easily compiled to current? So in the future users may suggest new useful functions (shame about SQL UDF performance)
Regards!
2016-12-28 14:48 GMT-03:00 imad hajj chahine <imad.hajj.chahine@gmail.com mailto:imad.hajj.chahine@gmail.com>:
Hi, After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates), so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other languages. The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance. Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient: / / /str/ /UDFyearbracket(str *ret, const date *v)/ /{/ /if (*v == date_nil) {/ /*ret = GDKstrdup(str_nil);/ /} else {/ /int year;/ /fromdate(*v, NULL, NULL, &year);/ /*ret = (str) GDKmalloc(15);/ /sprintf(*ret, "%d", year);/ /}/ /return MAL_SUCCEED;/ /}/ For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. /str/ /UDFBATyearbracket(bat *ret, const bat *bid)/ /{/ /BAT *b, *bn;/ /BUN i,n;/ /str *y;/ /const date *t;/ / / /if ((b = BATdescriptor(*bid)) == NULL)/ /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ /n = BATcount(b);/ / / /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ /if (bn == NULL) {/ /BBPunfix(b->batCacheid);/ /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ /}/ /bn->tnonil = 1;/ /bn->tnil = 0;/ / / /t = (const date *) Tloc(b, 0);/ /y = (str *) Tloc(bn, 0);/ /for (i = 0; i < n; i++) {/ /if (*t == date_nil) {/ /*y = GDKstrdup(str_nil);/ /} else/ /UDFyearbracket(y, t);/ /if (strcmp(*y, str_nil) == 0) {/ /bn->tnonil = 0;/ /bn->tnil = 1;/ /}/ /y++;/ /t++;/ /}/ / / /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ / / /bn->tsorted = BATcount(bn)<2;/ /bn->trevsorted = BATcount(bn)<2;/ / / /BBPkeepref(*ret = bn->batCacheid);/ /BBPunfix(b->batCacheid);/ /return MAL_SUCCEED;/ /}/ PS: I am not a c expert but i can find my way with basic operations and pointers. Any help or suggestions is appreciated. Thank you. _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Thank you Sjoerd,
Any idea how to convert an integer to UTF-8 string, does sprintf come with a variation that can handle UTF-8?
Thank you.
On Wed, Dec 28, 2016 at 11:08 PM, Sjoerd Mullender sjoerd@monetdb.org wrote:
See https://dev.monetdb.org/hg/MonetDB-extend/ for a tutorial on how to create a UDF in C. You can use the URL to clone from.
On 12/28/2016 09:28 PM, Alberto Ferrari wrote:
Imad, I hope your success with this. Please comment if you get it, and then, could those new functions incorporate to future version of Monet? Or maybe easily compiled to current? So in the future users may suggest new useful functions (shame about SQL UDF performance)
Regards!
2016-12-28 14:48 GMT-03:00 imad hajj chahine <imad.hajj.chahine@gmail.com mailto:imad.hajj.chahine@gmail.com>:
Hi, After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates), so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other
languages.
The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance. Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient: / / /str/ /UDFyearbracket(str *ret, const date *v)/ /{/ /if (*v == date_nil) {/ /*ret = GDKstrdup(str_nil);/ /} else {/ /int year;/ /fromdate(*v, NULL, NULL, &year);/ /*ret = (str) GDKmalloc(15);/ /sprintf(*ret, "%d", year);/ /}/ /return MAL_SUCCEED;/ /}/ For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. /str/ /UDFBATyearbracket(bat *ret, const bat *bid)/ /{/ /BAT *b, *bn;/ /BUN i,n;/ /str *y;/ /const date *t;/ / / /if ((b = BATdescriptor(*bid)) == NULL)/ /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ /n = BATcount(b);/ / / /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ /if (bn == NULL) {/ /BBPunfix(b->batCacheid);/ /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ /}/ /bn->tnonil = 1;/ /bn->tnil = 0;/ / / /t = (const date *) Tloc(b, 0);/ /y = (str *) Tloc(bn, 0);/ /for (i = 0; i < n; i++) {/ /if (*t == date_nil) {/ /*y = GDKstrdup(str_nil);/ /} else/ /UDFyearbracket(y, t);/ /if (strcmp(*y, str_nil) == 0) {/ /bn->tnonil = 0;/ /bn->tnil = 1;/ /}/ /y++;/ /t++;/ /}/ / / /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ / / /bn->tsorted = BATcount(bn)<2;/ /bn->trevsorted = BATcount(bn)<2;/ / / /BBPkeepref(*ret = bn->batCacheid);/ /BBPunfix(b->batCacheid);/ /return MAL_SUCCEED;/ /}/ PS: I am not a c expert but i can find my way with basic operations and pointers. Any help or suggestions is appreciated. Thank you. _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Hi again Sjoerd,
After digging in the code I found the GDKstrFromStr, does this function handle conversion from a normal string to UTF8_string? Is this the correct syntax to use the function:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; fromdate(*v, NULL, NULL, &year); *ret = (str) GDKmalloc(15); sprintf(*ret, "%d", year); GDKstrFromStr((unsigned char *)*ret, (unsigned char *)*ret, 15); } return MAL_SUCCEED; }
Thank you.
On Wed, Dec 28, 2016 at 11:40 PM, imad hajj chahine < imad.hajj.chahine@gmail.com> wrote:
Thank you Sjoerd,
Any idea how to convert an integer to UTF-8 string, does sprintf come with a variation that can handle UTF-8?
Thank you.
On Wed, Dec 28, 2016 at 11:08 PM, Sjoerd Mullender sjoerd@monetdb.org wrote:
See https://dev.monetdb.org/hg/MonetDB-extend/ for a tutorial on how to create a UDF in C. You can use the URL to clone from.
On 12/28/2016 09:28 PM, Alberto Ferrari wrote:
Imad, I hope your success with this. Please comment if you get it, and then, could those new functions incorporate to future version of Monet? Or maybe easily compiled to current? So in the future users may suggest new useful functions (shame about SQL UDF performance)
Regards!
2016-12-28 14:48 GMT-03:00 imad hajj chahine <imad.hajj.chahine@gmail.com mailto:imad.hajj.chahine@gmail.com>:
Hi, After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates), so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other
languages.
The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance. Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient: / / /str/ /UDFyearbracket(str *ret, const date *v)/ /{/ /if (*v == date_nil) {/ /*ret = GDKstrdup(str_nil);/ /} else {/ /int year;/ /fromdate(*v, NULL, NULL, &year);/ /*ret = (str) GDKmalloc(15);/ /sprintf(*ret, "%d", year);/ /}/ /return MAL_SUCCEED;/ /}/ For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. /str/ /UDFBATyearbracket(bat *ret, const bat *bid)/ /{/ /BAT *b, *bn;/ /BUN i,n;/ /str *y;/ /const date *t;/ / / /if ((b = BATdescriptor(*bid)) == NULL)/ /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ /n = BATcount(b);/ / / /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ /if (bn == NULL) {/ /BBPunfix(b->batCacheid);/ /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ /}/ /bn->tnonil = 1;/ /bn->tnil = 0;/ / / /t = (const date *) Tloc(b, 0);/ /y = (str *) Tloc(bn, 0);/ /for (i = 0; i < n; i++) {/ /if (*t == date_nil) {/ /*y = GDKstrdup(str_nil);/ /} else/ /UDFyearbracket(y, t);/ /if (strcmp(*y, str_nil) == 0) {/ /bn->tnonil = 0;/ /bn->tnil = 1;/ /}/ /y++;/ /t++;/ /}/ / / /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ / / /bn->tsorted = BATcount(bn)<2;/ /bn->trevsorted = BATcount(bn)<2;/ / / /BBPkeepref(*ret = bn->batCacheid);/ /BBPunfix(b->batCacheid);/ /return MAL_SUCCEED;/ /}/ PS: I am not a c expert but i can find my way with basic operations and pointers. Any help or suggestions is appreciated. Thank you. _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Since the MonetDB server is UTF-8 *only*, you should *never* have non-UTF-8 strings inside the server. If you have strings in some other encoding, they should be converted to UTF-8 by whatever client program you're using. mclient has options to do this (-e option). If you want to do conversions yourself, take a look at the iconv related code in common/stream/stream.c. Also, the console_read and console_write functions in that file can give you inspiration. They convert Windows wide characters (16-bit encodings of Unicode code points) to and from UTF-8. This would be close to converting ints to UTF-8.
On 12/29/2016 01:10 AM, imad hajj chahine wrote:
Hi again Sjoerd,
After digging in the code I found the GDKstrFromStr, does this function handle conversion from a normal string to UTF8_string? Is this the correct syntax to use the function:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; fromdate(*v, NULL, NULL, &year); *ret = (str) GDKmalloc(15); sprintf(*ret, "%d", year); GDKstrFromStr((unsigned char *)*ret, (unsigned char *)*ret, 15); } return MAL_SUCCEED; }
Thank you.
On Wed, Dec 28, 2016 at 11:40 PM, imad hajj chahine <imad.hajj.chahine@gmail.com mailto:imad.hajj.chahine@gmail.com> wrote:
Thank you Sjoerd, Any idea how to convert an integer to UTF-8 string, does sprintf come with a variation that can handle UTF-8? Thank you. On Wed, Dec 28, 2016 at 11:08 PM, Sjoerd Mullender <sjoerd@monetdb.org <mailto:sjoerd@monetdb.org>> wrote: See https://dev.monetdb.org/hg/MonetDB-extend/ <https://dev.monetdb.org/hg/MonetDB-extend/> for a tutorial on how to create a UDF in C. You can use the URL to clone from. On 12/28/2016 09:28 PM, Alberto Ferrari wrote: > Imad, I hope your success with this. Please comment if you get it, and > then, could those new functions incorporate to future version of Monet? > Or maybe easily compiled to current? So in the future users may suggest > new useful functions (shame about SQL UDF performance) > > Regards! > > 2016-12-28 14:48 GMT-03:00 imad hajj chahine > <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>>>: > > Hi, > > After reviewing all the other alternatives like SQL and Python UDF, > I was either stuck on performance with SQL UDF or on usability with > Python UDF (unable to use with aggregation, and not such great > performance with dates), > > so I decided to go the hard way with C functions, as a bonus it will > give me the possibility to change the functionalities without > worrying about dependencies, which was not the case in other languages. > > The purpose is to create a set of formatting functions for Year, > Quarter, Month, Week and Day brackets, and of course i need to > create the bulk version of each function for performance. > > Starting from the MTIMEdate_extract_year_bulk, now i have the simple > function working, and successfully calling it from mclient: > / > / > /str/ > /UDFyearbracket(str *ret, const date *v)/ > /{/ > /if (*v == date_nil) {/ > /*ret = GDKstrdup(str_nil);/ > /} else {/ > /int year;/ > /fromdate(*v, NULL, NULL, &year);/ > /*ret = (str) GDKmalloc(15);/ > /sprintf(*ret, "%d", year);/ > /}/ > /return MAL_SUCCEED;/ > /}/ > > > For the bulk version i get an error in the log: gdk_atoms.c:1345: > strPut: Assertion `(v[i] & 0x80) == 0' failed. > /str/ > /UDFBATyearbracket(bat *ret, const bat *bid)/ > /{/ > /BAT *b, *bn;/ > /BUN i,n;/ > /str *y;/ > /const date *t;/ > / > / > /if ((b = BATdescriptor(*bid)) == NULL)/ > /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ > /n = BATcount(b);/ > / > / > /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ > /if (bn == NULL) {/ > /BBPunfix(b->batCacheid);/ > /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ > /}/ > /bn->tnonil = 1;/ > /bn->tnil = 0;/ > / > / > /t = (const date *) Tloc(b, 0);/ > /y = (str *) Tloc(bn, 0);/ > /for (i = 0; i < n; i++) {/ > /if (*t == date_nil) {/ > /*y = GDKstrdup(str_nil);/ > /} else/ > /UDFyearbracket(y, t);/ > /if (strcmp(*y, str_nil) == 0) {/ > /bn->tnonil = 0;/ > /bn->tnil = 1;/ > /}/ > /y++;/ > /t++;/ > /}/ > / > / > /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ > / > / > /bn->tsorted = BATcount(bn)<2;/ > /bn->trevsorted = BATcount(bn)<2;/ > / > / > /BBPkeepref(*ret = bn->batCacheid);/ > /BBPunfix(b->batCacheid);/ > /return MAL_SUCCEED;/ > /}/ > > PS: I am not a c expert but i can find my way with basic operations > and pointers. > > Any help or suggestions is appreciated. > > Thank you. > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> > <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>> > > > > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> > -- Sjoerd Mullender _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen; int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED; }
Thanks
On Thu, Dec 29, 2016 at 1:08 PM, Sjoerd Mullender sjoerd@monetdb.org wrote:
Since the MonetDB server is UTF-8 *only*, you should *never* have non-UTF-8 strings inside the server. If you have strings in some other encoding, they should be converted to UTF-8 by whatever client program you're using. mclient has options to do this (-e option). If you want to do conversions yourself, take a look at the iconv related code in common/stream/stream.c. Also, the console_read and console_write functions in that file can give you inspiration. They convert Windows wide characters (16-bit encodings of Unicode code points) to and from UTF-8. This would be close to converting ints to UTF-8.
On 12/29/2016 01:10 AM, imad hajj chahine wrote:
Hi again Sjoerd,
After digging in the code I found the GDKstrFromStr, does this function handle conversion from a normal string to UTF8_string? Is this the correct syntax to use the function:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; fromdate(*v, NULL, NULL, &year); *ret = (str) GDKmalloc(15); sprintf(*ret, "%d", year); GDKstrFromStr((unsigned char *)*ret, (unsigned char *)*ret, 15); } return MAL_SUCCEED; }
Thank you.
On Wed, Dec 28, 2016 at 11:40 PM, imad hajj chahine <imad.hajj.chahine@gmail.com mailto:imad.hajj.chahine@gmail.com>
wrote:
Thank you Sjoerd, Any idea how to convert an integer to UTF-8 string, does sprintf come with a variation that can handle UTF-8? Thank you. On Wed, Dec 28, 2016 at 11:08 PM, Sjoerd Mullender <sjoerd@monetdb.org <mailto:sjoerd@monetdb.org>> wrote: See https://dev.monetdb.org/hg/MonetDB-extend/ <https://dev.monetdb.org/hg/MonetDB-extend/> for a tutorial on how to create a UDF in C. You can use the URL to clone from. On 12/28/2016 09:28 PM, Alberto Ferrari wrote: > Imad, I hope your success with this. Please comment if you get
it, and
> then, could those new functions incorporate to future version
of Monet?
> Or maybe easily compiled to current? So in the future users
may suggest
> new useful functions (shame about SQL UDF performance) > > Regards! > > 2016-12-28 14:48 GMT-03:00 imad hajj chahine > <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>>>: > > Hi, > > After reviewing all the other alternatives like SQL and
Python UDF,
> I was either stuck on performance with SQL UDF or on
usability with
> Python UDF (unable to use with aggregation, and not such
great
> performance with dates), > > so I decided to go the hard way with C functions, as a
bonus it will
> give me the possibility to change the functionalities
without
> worrying about dependencies, which was not the case in
other languages.
> > The purpose is to create a set of formatting functions for
Year,
> Quarter, Month, Week and Day brackets, and of course i
need to
> create the bulk version of each function for performance. > > Starting from the MTIMEdate_extract_year_bulk, now i have
the simple
> function working, and successfully calling it from mclient: > / > / > /str/ > /UDFyearbracket(str *ret, const date *v)/ > /{/ > /if (*v == date_nil) {/ > /*ret = GDKstrdup(str_nil);/ > /} else {/ > /int year;/ > /fromdate(*v, NULL, NULL, &year);/ > /*ret = (str) GDKmalloc(15);/ > /sprintf(*ret, "%d", year);/ > /}/ > /return MAL_SUCCEED;/ > /}/ > > > For the bulk version i get an error in the log:
gdk_atoms.c:1345:
> strPut: Assertion `(v[i] & 0x80) == 0' failed. > /str/ > /UDFBATyearbracket(bat *ret, const bat *bid)/ > /{/ > /BAT *b, *bn;/ > /BUN i,n;/ > /str *y;/ > /const date *t;/ > / > / > /if ((b = BATdescriptor(*bid)) == NULL)/ > /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ > /n = BATcount(b);/ > / > / > /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b),
TRANSIENT);/
> /if (bn == NULL) {/ > /BBPunfix(b->batCacheid);/ > /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ > /}/ > /bn->tnonil = 1;/ > /bn->tnil = 0;/ > / > / > /t = (const date *) Tloc(b, 0);/ > /y = (str *) Tloc(bn, 0);/ > /for (i = 0; i < n; i++) {/ > /if (*t == date_nil) {/ > /*y = GDKstrdup(str_nil);/ > /} else/ > /UDFyearbracket(y, t);/ > /if (strcmp(*y, str_nil) == 0) {/ > /bn->tnonil = 0;/ > /bn->tnil = 1;/ > /}/ > /y++;/ > /t++;/ > /}/ > / > / > /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ > / > / > /bn->tsorted = BATcount(bn)<2;/ > /bn->trevsorted = BATcount(bn)<2;/ > / > / > /BBPkeepref(*ret = bn->batCacheid);/ > /BBPunfix(b->batCacheid);/ > /return MAL_SUCCEED;/ > /}/ > > PS: I am not a c expert but i can find my way with basic
operations
> and pointers. > > Any help or suggestions is appreciated. > > Thank you. > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> > <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>> > > > > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> > -- Sjoerd Mullender _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Hey Imad,
One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; }
Regards,
Mark
On 29 Dec 2016, at 14:35, imad hajj chahine imad.hajj.chahine@gmail.com wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen; int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED; }
Thanks
On Thu, Dec 29, 2016 at 1:08 PM, Sjoerd Mullender <sjoerd@monetdb.org mailto:sjoerd@monetdb.org> wrote: Since the MonetDB server is UTF-8 *only*, you should *never* have non-UTF-8 strings inside the server. If you have strings in some other encoding, they should be converted to UTF-8 by whatever client program you're using. mclient has options to do this (-e option). If you want to do conversions yourself, take a look at the iconv related code in common/stream/stream.c. Also, the console_read and console_write functions in that file can give you inspiration. They convert Windows wide characters (16-bit encodings of Unicode code points) to and from UTF-8. This would be close to converting ints to UTF-8.
On 12/29/2016 01:10 AM, imad hajj chahine wrote:
Hi again Sjoerd,
After digging in the code I found the GDKstrFromStr, does this function handle conversion from a normal string to UTF8_string? Is this the correct syntax to use the function:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; fromdate(*v, NULL, NULL, &year); *ret = (str) GDKmalloc(15); sprintf(*ret, "%d", year); GDKstrFromStr((unsigned char *)*ret, (unsigned char *)*ret, 15); } return MAL_SUCCEED; }
Thank you.
On Wed, Dec 28, 2016 at 11:40 PM, imad hajj chahine <imad.hajj.chahine@gmail.com mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com mailto:imad.hajj.chahine@gmail.com>> wrote:
Thank you Sjoerd, Any idea how to convert an integer to UTF-8 string, does sprintf come with a variation that can handle UTF-8? Thank you. On Wed, Dec 28, 2016 at 11:08 PM, Sjoerd Mullender <sjoerd@monetdb.org <mailto:sjoerd@monetdb.org> <mailto:sjoerd@monetdb.org <mailto:sjoerd@monetdb.org>>> wrote: See https://dev.monetdb.org/hg/MonetDB-extend/ <https://dev.monetdb.org/hg/MonetDB-extend/> <https://dev.monetdb.org/hg/MonetDB-extend/ <https://dev.monetdb.org/hg/MonetDB-extend/>> for a tutorial on how to create a UDF in C. You can use the URL to clone from. On 12/28/2016 09:28 PM, Alberto Ferrari wrote: > Imad, I hope your success with this. Please comment if you get it, and > then, could those new functions incorporate to future version of Monet? > Or maybe easily compiled to current? So in the future users may suggest > new useful functions (shame about SQL UDF performance) > > Regards! > > 2016-12-28 14:48 GMT-03:00 imad hajj chahine > <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>>>>: > > Hi, > > After reviewing all the other alternatives like SQL and Python UDF, > I was either stuck on performance with SQL UDF or on usability with > Python UDF (unable to use with aggregation, and not such great > performance with dates), > > so I decided to go the hard way with C functions, as a bonus it will > give me the possibility to change the functionalities without > worrying about dependencies, which was not the case in other languages. > > The purpose is to create a set of formatting functions for Year, > Quarter, Month, Week and Day brackets, and of course i need to > create the bulk version of each function for performance. > > Starting from the MTIMEdate_extract_year_bulk, now i have the simple > function working, and successfully calling it from mclient: > / > / > /str/ > /UDFyearbracket(str *ret, const date *v)/ > /{/ > /if (*v == date_nil) {/ > /*ret = GDKstrdup(str_nil);/ > /} else {/ > /int year;/ > /fromdate(*v, NULL, NULL, &year);/ > /*ret = (str) GDKmalloc(15);/ > /sprintf(*ret, "%d", year);/ > /}/ > /return MAL_SUCCEED;/ > /}/ > > > For the bulk version i get an error in the log: gdk_atoms.c:1345: > strPut: Assertion `(v[i] & 0x80) == 0' failed. > /str/ > /UDFBATyearbracket(bat *ret, const bat *bid)/ > /{/ > /BAT *b, *bn;/ > /BUN i,n;/ > /str *y;/ > /const date *t;/ > / > / > /if ((b = BATdescriptor(*bid)) == NULL)/ > /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ > /n = BATcount(b);/ > / > / > /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ > /if (bn == NULL) {/ > /BBPunfix(b->batCacheid);/ > /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ > /}/ > /bn->tnonil = 1;/ > /bn->tnil = 0;/ > / > / > /t = (const date *) Tloc(b, 0);/ > /y = (str *) Tloc(bn, 0);/ > /for (i = 0; i < n; i++) {/ > /if (*t == date_nil) {/ > /*y = GDKstrdup(str_nil);/ > /} else/ > /UDFyearbracket(y, t);/ > /if (strcmp(*y, str_nil) == 0) {/ > /bn->tnonil = 0;/ > /bn->tnil = 1;/ > /}/ > /y++;/ > /t++;/ > /}/ > / > / > /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ > / > / > /bn->tsorted = BATcount(bn)<2;/ > /bn->trevsorted = BATcount(bn)<2;/ > / > / > /BBPkeepref(*ret = bn->batCacheid);/ > /BBPunfix(b->batCacheid);/ > /return MAL_SUCCEED;/ > /}/ > > PS: I am not a c expert but i can find my way with basic operations > and pointers. > > Any help or suggestions is appreciated. > > Thank you. > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>>> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>> > <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>>> > > > > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>> > -- Sjoerd Mullender _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>>
users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender
users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Hey Imad,
Apologies, scrolling back I noticed that was actually your first attempt at writing the UDF. The source of your error is not encoding related, the error is misleading.
The problem is that in your bulk version you are using Tloc(bn, i) to assign to a string column. Tloc should only be used with constant-sized columns, such as integers or dates. For variable-sized columns such as strings, you should use BUNappend to add values to the column. The reason for that is that string columns are not stored as an array of character pointers, which your initial implementation assumes. Instead, string columns use integers to point into a heap of strings. You are assigning a pointer to one of these integers, which makes MonetDB think the strings are in some random part of your memory. There’s a high chance that that random part of memory does not contain a valid UTF-8 string, hence you get the encoding error.
Try the following bulk implementation instead, using BUNappend instead of Tloc to assign to your column.
Regards,
Mark
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BUN i,n; const date *t;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); } bn->tnonil = 1; bn->tnil = 0;
t = (const date *) Tloc(b, 0); for (i = 0; i < n; i++) { if (*t == date_nil) { BUNappend(bn, str_nil, FALSE); bn->tnonil = 0; bn->tnil = 1; } else { char* ret; UDFyearbracket(&ret, t); BUNappend(bn, ret, FALSE); } t++; }
BATsetcount(bn, n);
bn->tsorted = BATcount(bn)<2; bn->trevsorted = BATcount(bn)<2;
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; }
----- Original Message ----- From: "Mark Raasveldt" m.raasveldt@cwi.nl To: "users-list" users-list@monetdb.org Sent: Monday, January 2, 2017 4:32:32 PM Subject: Re: C UDF
Hey Imad,
One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; }
Regards,
Mark
On 29 Dec 2016, at 14:35, imad hajj chahine imad.hajj.chahine@gmail.com wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen; int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED; }
Thanks
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Thank you Mark,
Actually I managed to solve this issue on late Friday night, even when using TLoc with integer values i was having random errors in the log and the db was shutdown. Do I need to set tonil and tnil flags when using BATloop/bat_iterator/BUNappend?
Find bellow the complete implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y); *ret = (str)GDKmalloc(snprintf(NULL, 0, "%d", y) + 1); if (*ret == NULL) throw(MAL, "UDF.yearbracket", "memory allocation failure"); sprintf(*ret, "%d", y); } return MAL_SUCCEED; }
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); }
bi = bat_iterator(b); BATloop(b, i, n) { char *y = NULL; const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } GDKfree(y); }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); }
Thank You.
On Mon, Jan 2, 2017 at 5:58 PM, Mark Raasveldt m.raasveldt@cwi.nl wrote:
Hey Imad,
Apologies, scrolling back I noticed that was actually your first attempt at writing the UDF. The source of your error is not encoding related, the error is misleading.
The problem is that in your bulk version you are using Tloc(bn, i) to assign to a string column. Tloc should only be used with constant-sized columns, such as integers or dates. For variable-sized columns such as strings, you should use BUNappend to add values to the column. The reason for that is that string columns are not stored as an array of character pointers, which your initial implementation assumes. Instead, string columns use integers to point into a heap of strings. You are assigning a pointer to one of these integers, which makes MonetDB think the strings are in some random part of your memory. There’s a high chance that that random part of memory does not contain a valid UTF-8 string, hence you get the encoding error.
Try the following bulk implementation instead, using BUNappend instead of Tloc to assign to your column.
Regards,
Mark
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BUN i,n; const date *t;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access
descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation
failure"); } bn->tnonil = 1; bn->tnil = 0;
t = (const date *) Tloc(b, 0); for (i = 0; i < n; i++) { if (*t == date_nil) { BUNappend(bn, str_nil, FALSE); bn->tnonil = 0; bn->tnil = 1; } else { char* ret; UDFyearbracket(&ret, t); BUNappend(bn, ret, FALSE); } t++; } BATsetcount(bn, n); bn->tsorted = BATcount(bn)<2; bn->trevsorted = BATcount(bn)<2; BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED;
}
----- Original Message ----- From: "Mark Raasveldt" m.raasveldt@cwi.nl To: "users-list" users-list@monetdb.org Sent: Monday, January 2, 2017 4:32:32 PM Subject: Re: C UDF
Hey Imad,
One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; }
Regards,
Mark
On 29 Dec 2016, at 14:35, imad hajj chahine imad.hajj.chahine@gmail.com
wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I
assumed the encoding that i am getting from sprintf are in "ISO-8859-1"
Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen;
int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED;
}
Thanks
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Hey Imad,
You don’t need to set the properties manually when using BUNappend; it will do it for you. Your implementation looks correct, although from a performance perspective I would offer you two tips:
- There is no need to allocate/free on every iteration, you can simply create a reasonably sized buffer once and use it for every iteration. A year can never be more than 10~ characters anyway. - In the same vein, using snprintf to determine the length of the string on every iteration is a bit overkill.
Mark
On 02 Jan 2017, at 17:31, imad hajj chahine imad.hajj.chahine@gmail.com wrote:
Thank you Mark,
Actually I managed to solve this issue on late Friday night, even when using TLoc with integer values i was having random errors in the log and the db was shutdown. Do I need to set tonil and tnil flags when using BATloop/bat_iterator/BUNappend?
Find bellow the complete implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y); *ret = (str)GDKmalloc(snprintf(NULL, 0, "%d", y) + 1); if (*ret == NULL) throw(MAL, "UDF.yearbracket", "memory allocation failure"); sprintf(*ret, "%d", y); } return MAL_SUCCEED; }
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); }
bi = bat_iterator(b); BATloop(b, i, n) { char *y = NULL; const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } GDKfree(y); }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED;
bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); }
Thank You.
On Mon, Jan 2, 2017 at 5:58 PM, Mark Raasveldt <m.raasveldt@cwi.nl mailto:m.raasveldt@cwi.nl> wrote: Hey Imad,
Apologies, scrolling back I noticed that was actually your first attempt at writing the UDF. The source of your error is not encoding related, the error is misleading.
The problem is that in your bulk version you are using Tloc(bn, i) to assign to a string column. Tloc should only be used with constant-sized columns, such as integers or dates. For variable-sized columns such as strings, you should use BUNappend to add values to the column. The reason for that is that string columns are not stored as an array of character pointers, which your initial implementation assumes. Instead, string columns use integers to point into a heap of strings. You are assigning a pointer to one of these integers, which makes MonetDB think the strings are in some random part of your memory. There’s a high chance that that random part of memory does not contain a valid UTF-8 string, hence you get the encoding error.
Try the following bulk implementation instead, using BUNappend instead of Tloc to assign to your column.
Regards,
Mark
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BUN i,n; const date *t;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b); bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); } bn->tnonil = 1; bn->tnil = 0; t = (const date *) Tloc(b, 0); for (i = 0; i < n; i++) { if (*t == date_nil) { BUNappend(bn, str_nil, FALSE); bn->tnonil = 0; bn->tnil = 1; } else { char* ret; UDFyearbracket(&ret, t); BUNappend(bn, ret, FALSE); } t++; } BATsetcount(bn, n); bn->tsorted = BATcount(bn)<2; bn->trevsorted = BATcount(bn)<2; BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED;
}
----- Original Message ----- From: "Mark Raasveldt" <m.raasveldt@cwi.nl mailto:m.raasveldt@cwi.nl> To: "users-list" <users-list@monetdb.org mailto:users-list@monetdb.org> Sent: Monday, January 2, 2017 4:32:32 PM Subject: Re: C UDF
Hey Imad,
One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; }
Regards,
Mark
On 29 Dec 2016, at 14:35, imad hajj chahine <imad.hajj.chahine@gmail.com mailto:imad.hajj.chahine@gmail.com> wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen;
int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED;
}
Thanks
users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list https://www.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Thanks Mark,
When I use sprintf(*ret, "%d", y) on a pre-allocated buffer of 15 chars and write only 4 chars the unused characters will be \0 and this will not cause any problem as BUNappend will take a copy of the buffer and stop at the first \0?
So the implementation will be:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y); sprintf(*ret, "%d", y); } return MAL_SUCCEED; }
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n; char *y; if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); }
bi = bat_iterator(b); y = (char *)GDKmalloc(15); /* longest possible string: "-5867411-01-01" i.e. 14 chars without NUL (see definition of YEAR_MIN/YEAR_MAX above) */ BATloop(b, i, n) { const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } } GDKfree(y);
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); }
On Mon, Jan 2, 2017 at 6:54 PM, Mark Raasveldt m.raasveldt@cwi.nl wrote:
Hey Imad,
You don’t need to set the properties manually when using BUNappend; it will do it for you. Your implementation looks correct, although from a performance perspective I would offer you two tips:
- There is no need to allocate/free on every iteration, you can simply
create a reasonably sized buffer once and use it for every iteration. A year can never be more than 10~ characters anyway.
- In the same vein, using snprintf to determine the length of the string
on every iteration is a bit overkill.
Mark
On 02 Jan 2017, at 17:31, imad hajj chahine imad.hajj.chahine@gmail.com wrote:
Thank you Mark,
Actually I managed to solve this issue on late Friday night, even when using TLoc with integer values i was having random errors in the log and the db was shutdown. Do I need to set tonil and tnil flags when using BATloop/bat_iterator/ BUNappend?
Find bellow the complete implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y); *ret = (str)GDKmalloc(snprintf(NULL, 0, "%d", y) + 1); if (*ret == NULL) throw(MAL, "UDF.yearbracket", "memory allocation failure"); sprintf(*ret, "%d", y); } return MAL_SUCCEED; }
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); }
bi = bat_iterator(b); BATloop(b, i, n) { char *y = NULL; const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } GDKfree(y); }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); }
Thank You.
On Mon, Jan 2, 2017 at 5:58 PM, Mark Raasveldt m.raasveldt@cwi.nl wrote:
Hey Imad,
Apologies, scrolling back I noticed that was actually your first attempt at writing the UDF. The source of your error is not encoding related, the error is misleading.
The problem is that in your bulk version you are using Tloc(bn, i) to assign to a string column. Tloc should only be used with constant-sized columns, such as integers or dates. For variable-sized columns such as strings, you should use BUNappend to add values to the column. The reason for that is that string columns are not stored as an array of character pointers, which your initial implementation assumes. Instead, string columns use integers to point into a heap of strings. You are assigning a pointer to one of these integers, which makes MonetDB think the strings are in some random part of your memory. There’s a high chance that that random part of memory does not contain a valid UTF-8 string, hence you get the encoding error.
Try the following bulk implementation instead, using BUNappend instead of Tloc to assign to your column.
Regards,
Mark
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BUN i,n; const date *t;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access
descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation
failure"); } bn->tnonil = 1; bn->tnil = 0;
t = (const date *) Tloc(b, 0); for (i = 0; i < n; i++) { if (*t == date_nil) { BUNappend(bn, str_nil, FALSE); bn->tnonil = 0; bn->tnil = 1; } else { char* ret; UDFyearbracket(&ret, t); BUNappend(bn, ret, FALSE); } t++; } BATsetcount(bn, n); bn->tsorted = BATcount(bn)<2; bn->trevsorted = BATcount(bn)<2; BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED;
}
----- Original Message ----- From: "Mark Raasveldt" m.raasveldt@cwi.nl To: "users-list" users-list@monetdb.org Sent: Monday, January 2, 2017 4:32:32 PM Subject: Re: C UDF
Hey Imad,
One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; }
Regards,
Mark
On 29 Dec 2016, at 14:35, imad hajj chahine <
imad.hajj.chahine@gmail.com> wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I
assumed the encoding that i am getting from sprintf are in "ISO-8859-1"
Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen;
int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED;
}
Thanks
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Hey Imad,
Yes, that is the way strings in C generally work (see: https://en.wikipedia.org/wiki/Null-terminated_string https://en.wikipedia.org/wiki/Null-terminated_string). Your implementation looks good, except you overwrite your buffer variable (y) when you encounter date_nil. Consider something like this for your main loop:
bi = bat_iterator(b); y = (char *)GDKmalloc(15); BATloop(b, i, n) { const date *t = (const date *) BUNtail(bi, i); char* res = str_nil; if (*t != date_nil) { UDFyearbracket(&y, t); res = y; } if (BUNappend(bn, res, FALSE) != GDK_SUCCEED) { goto bailout; } } GDKfree(y);
Mark
On 02 Jan 2017, at 18:20, imad hajj chahine imad.hajj.chahine@gmail.com wrote:
Thanks Mark,
When I use sprintf(*ret, "%d", y) on a pre-allocated buffer of 15 chars and write only 4 chars the unused characters will be \0 and this will not cause any problem as BUNappend will take a copy of the buffer and stop at the first \0?
So the implementation will be:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y); sprintf(*ret, "%d", y); } return MAL_SUCCEED; }
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n; char *y;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); }
bi = bat_iterator(b); y = (char *)GDKmalloc(15); /* longest possible string: "-5867411-01-01" i.e. 14 chars without NUL (see definition of YEAR_MIN/YEAR_MAX above) */ BATloop(b, i, n) { const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } } GDKfree(y);
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED;
bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); }
On Mon, Jan 2, 2017 at 6:54 PM, Mark Raasveldt <m.raasveldt@cwi.nl mailto:m.raasveldt@cwi.nl> wrote: Hey Imad,
You don’t need to set the properties manually when using BUNappend; it will do it for you. Your implementation looks correct, although from a performance perspective I would offer you two tips:
- There is no need to allocate/free on every iteration, you can simply create a reasonably sized buffer once and use it for every iteration. A year can never be more than 10~ characters anyway.
- In the same vein, using snprintf to determine the length of the string on every iteration is a bit overkill.
Mark
On 02 Jan 2017, at 17:31, imad hajj chahine <imad.hajj.chahine@gmail.com mailto:imad.hajj.chahine@gmail.com> wrote:
Thank you Mark,
Actually I managed to solve this issue on late Friday night, even when using TLoc with integer values i was having random errors in the log and the db was shutdown. Do I need to set tonil and tnil flags when using BATloop/bat_iterator/BUNappend?
Find bellow the complete implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y); *ret = (str)GDKmalloc(snprintf(NULL, 0, "%d", y) + 1); if (*ret == NULL) throw(MAL, "UDF.yearbracket", "memory allocation failure"); sprintf(*ret, "%d", y); } return MAL_SUCCEED; }
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); }
bi = bat_iterator(b); BATloop(b, i, n) { char *y = NULL; const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } GDKfree(y); }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED;
bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); }
Thank You.
On Mon, Jan 2, 2017 at 5:58 PM, Mark Raasveldt <m.raasveldt@cwi.nl mailto:m.raasveldt@cwi.nl> wrote: Hey Imad,
Apologies, scrolling back I noticed that was actually your first attempt at writing the UDF. The source of your error is not encoding related, the error is misleading.
The problem is that in your bulk version you are using Tloc(bn, i) to assign to a string column. Tloc should only be used with constant-sized columns, such as integers or dates. For variable-sized columns such as strings, you should use BUNappend to add values to the column. The reason for that is that string columns are not stored as an array of character pointers, which your initial implementation assumes. Instead, string columns use integers to point into a heap of strings. You are assigning a pointer to one of these integers, which makes MonetDB think the strings are in some random part of your memory. There’s a high chance that that random part of memory does not contain a valid UTF-8 string, hence you get the encoding error.
Try the following bulk implementation instead, using BUNappend instead of Tloc to assign to your column.
Regards,
Mark
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BUN i,n; const date *t;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b); bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); } bn->tnonil = 1; bn->tnil = 0; t = (const date *) Tloc(b, 0); for (i = 0; i < n; i++) { if (*t == date_nil) { BUNappend(bn, str_nil, FALSE); bn->tnonil = 0; bn->tnil = 1; } else { char* ret; UDFyearbracket(&ret, t); BUNappend(bn, ret, FALSE); } t++; } BATsetcount(bn, n); bn->tsorted = BATcount(bn)<2; bn->trevsorted = BATcount(bn)<2; BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED;
}
----- Original Message ----- From: "Mark Raasveldt" <m.raasveldt@cwi.nl mailto:m.raasveldt@cwi.nl> To: "users-list" <users-list@monetdb.org mailto:users-list@monetdb.org> Sent: Monday, January 2, 2017 4:32:32 PM Subject: Re: C UDF
Hey Imad,
One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; }
Regards,
Mark
On 29 Dec 2016, at 14:35, imad hajj chahine <imad.hajj.chahine@gmail.com mailto:imad.hajj.chahine@gmail.com> wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen;
int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED;
}
Thanks
users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list https://www.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org mailto:users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Mark,
Does the same apply for Integer values, Meaning its better to declare one the int value and used in all the iteration, do I have a performance issue with the following code:
str UDFyearlag(int *ret, const date *v1, const date *v2) { if (*v1 == date_nil || *v2 == date_nil) { *ret = int_nil; } else { int y1 = 0, y2 = 0; fromdate(*v1, NULL, NULL, &y1); fromdate(*v2, NULL, NULL, &y2); *ret = y2 - y1; } return MAL_SUCCEED; }
str UDFBATyearlag(bat *ret, const bat *bid1, const bat *bid2) { BAT *b1, *b2, *bn; BATiter bi1, bi2; BUN i,n;
b1 = BATdescriptor(*bid1); b2 = BATdescriptor(*bid2); if (b1 == NULL || b2 == NULL) { if (b1) BBPunfix(b1->batCacheid); if (b2) BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "Cannot access descriptor"); } n = BATcount(b1);
bn = COLnew(b1->hseqbase, TYPE_int, BATcount(b1), TRANSIENT); if (bn == NULL) { BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "memory allocation failure"); }
bi1 = bat_iterator(b1); bi2 = bat_iterator(b2); BATloop(b1, i, n) { int y; const date *t1 = (const date *) BUNtail(bi1, i); const date *t2 = (const date *) BUNtail(bi2, i); if (*t1 == date_nil || *t2 == date_nil) { y = int_nil; } else UDFyearlag(&y, t1, t2); if (BUNappend(bn, &y, 0) != GDK_SUCCEED) { goto bailout; } }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearlag", MAL_MALLOC_FAIL); }
Hey Imad,
No, it does not apply to integers because you are not doing any heap allocation.
Mark
On 02 Jan 2017, at 19:03, imad hajj chahine imad.hajj.chahine@gmail.com wrote:
Mark,
Does the same apply for Integer values, Meaning its better to declare one the int value and used in all the iteration, do I have a performance issue with the following code:
str UDFyearlag(int *ret, const date *v1, const date *v2) { if (*v1 == date_nil || *v2 == date_nil) { *ret = int_nil; } else { int y1 = 0, y2 = 0; fromdate(*v1, NULL, NULL, &y1); fromdate(*v2, NULL, NULL, &y2); *ret = y2 - y1; } return MAL_SUCCEED; }
str UDFBATyearlag(bat *ret, const bat *bid1, const bat *bid2) { BAT *b1, *b2, *bn; BATiter bi1, bi2; BUN i,n;
b1 = BATdescriptor(*bid1); b2 = BATdescriptor(*bid2); if (b1 == NULL || b2 == NULL) { if (b1) BBPunfix(b1->batCacheid); if (b2) BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "Cannot access descriptor"); } n = BATcount(b1);
bn = COLnew(b1->hseqbase, TYPE_int, BATcount(b1), TRANSIENT); if (bn == NULL) { BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "memory allocation failure"); }
bi1 = bat_iterator(b1); bi2 = bat_iterator(b2); BATloop(b1, i, n) { int y; const date *t1 = (const date *) BUNtail(bi1, i); const date *t2 = (const date *) BUNtail(bi2, i); if (*t1 == date_nil || *t2 == date_nil) { y = int_nil; } else UDFyearlag(&y, t1, t2); if (BUNappend(bn, &y, 0) != GDK_SUCCEED) { goto bailout; } }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); return MAL_SUCCEED;
bailout: BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearlag", MAL_MALLOC_FAIL); } _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Thank you Mark,
I should refresh my memory about C programming, I never thought i would code in C again.
On Tue, Jan 3, 2017 at 3:48 PM, Mark Raasveldt m.raasveldt@cwi.nl wrote:
Hey Imad,
No, it does not apply to integers because you are not doing any heap allocation.
Mark
On 02 Jan 2017, at 19:03, imad hajj chahine imad.hajj.chahine@gmail.com wrote:
Mark,
Does the same apply for Integer values, Meaning its better to declare one the int value and used in all the iteration, do I have a performance issue with the following code:
str UDFyearlag(int *ret, const date *v1, const date *v2) { if (*v1 == date_nil || *v2 == date_nil) { *ret = int_nil; } else { int y1 = 0, y2 = 0; fromdate(*v1, NULL, NULL, &y1); fromdate(*v2, NULL, NULL, &y2); *ret = y2 - y1; } return MAL_SUCCEED; }
str UDFBATyearlag(bat *ret, const bat *bid1, const bat *bid2) { BAT *b1, *b2, *bn; BATiter bi1, bi2; BUN i,n;
b1 = BATdescriptor(*bid1); b2 = BATdescriptor(*bid2); if (b1 == NULL || b2 == NULL) { if (b1) BBPunfix(b1->batCacheid); if (b2) BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "Cannot access descriptor"); } n = BATcount(b1);
bn = COLnew(b1->hseqbase, TYPE_int, BATcount(b1), TRANSIENT); if (bn == NULL) { BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "memory allocation failure"); }
bi1 = bat_iterator(b1); bi2 = bat_iterator(b2); BATloop(b1, i, n) { int y; const date *t1 = (const date *) BUNtail(bi1, i); const date *t2 = (const date *) BUNtail(bi2, i); if (*t1 == date_nil || *t2 == date_nil) { y = int_nil; } else UDFyearlag(&y, t1, t2); if (BUNappend(bn, &y, 0) != GDK_SUCCEED) { goto bailout; } }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearlag", MAL_MALLOC_FAIL); } _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
Hi Mark,
Is it possible to return Table or multiple columns from a C function? Any example or existing function I can check in the code?
Also is it possible using a C function to expand table rows instead of joining to a calendar table, what i am trying to do is to expand a row and return an entry for each month bracket between startdate and enddate?
Thanks
On Tue, Jan 3, 2017 at 4:45 PM, imad hajj chahine < imad.hajj.chahine@gmail.com> wrote:
Thank you Mark,
I should refresh my memory about C programming, I never thought i would code in C again.
On Tue, Jan 3, 2017 at 3:48 PM, Mark Raasveldt m.raasveldt@cwi.nl wrote:
Hey Imad,
No, it does not apply to integers because you are not doing any heap allocation.
Mark
On 02 Jan 2017, at 19:03, imad hajj chahine imad.hajj.chahine@gmail.com wrote:
Mark,
Does the same apply for Integer values, Meaning its better to declare one the int value and used in all the iteration, do I have a performance issue with the following code:
str UDFyearlag(int *ret, const date *v1, const date *v2) { if (*v1 == date_nil || *v2 == date_nil) { *ret = int_nil; } else { int y1 = 0, y2 = 0; fromdate(*v1, NULL, NULL, &y1); fromdate(*v2, NULL, NULL, &y2); *ret = y2 - y1; } return MAL_SUCCEED; }
str UDFBATyearlag(bat *ret, const bat *bid1, const bat *bid2) { BAT *b1, *b2, *bn; BATiter bi1, bi2; BUN i,n;
b1 = BATdescriptor(*bid1); b2 = BATdescriptor(*bid2); if (b1 == NULL || b2 == NULL) { if (b1) BBPunfix(b1->batCacheid); if (b2) BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "Cannot access descriptor"); } n = BATcount(b1);
bn = COLnew(b1->hseqbase, TYPE_int, BATcount(b1), TRANSIENT); if (bn == NULL) { BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "memory allocation failure"); }
bi1 = bat_iterator(b1); bi2 = bat_iterator(b2); BATloop(b1, i, n) { int y; const date *t1 = (const date *) BUNtail(bi1, i); const date *t2 = (const date *) BUNtail(bi2, i); if (*t1 == date_nil || *t2 == date_nil) { y = int_nil; } else UDFyearlag(&y, t1, t2); if (BUNappend(bn, &y, 0) != GDK_SUCCEED) { goto bailout; } }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearlag", MAL_MALLOC_FAIL); } _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
The error indicates that your string is not UTF-8 encoded.
On 12/28/2016 06:48 PM, imad hajj chahine wrote:
Hi,
After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates),
so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other languages.
The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance.
Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient: / / /str/ /UDFyearbracket(str *ret, const date *v)/ /{/ /if (*v == date_nil) {/ /*ret = GDKstrdup(str_nil);/ /} else {/ /int year;/ /fromdate(*v, NULL, NULL, &year);/ /*ret = (str) GDKmalloc(15);/ /sprintf(*ret, "%d", year);/ /}/ /return MAL_SUCCEED;/ /}/
For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. /str/ /UDFBATyearbracket(bat *ret, const bat *bid)/ /{/ /BAT *b, *bn;/ /BUN i,n;/ /str *y;/ /const date *t;/ / / /if ((b = BATdescriptor(*bid)) == NULL)/ /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ /n = BATcount(b);/ / / /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ /if (bn == NULL) {/ /BBPunfix(b->batCacheid);/ /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ /}/ /bn->tnonil = 1;/ /bn->tnil = 0;/ / / /t = (const date *) Tloc(b, 0);/ /y = (str *) Tloc(bn, 0);/ /for (i = 0; i < n; i++) {/ /if (*t == date_nil) {/ /*y = GDKstrdup(str_nil);/ /} else/ /UDFyearbracket(y, t);/ /if (strcmp(*y, str_nil) == 0) {/ /bn->tnonil = 0;/ /bn->tnil = 1;/ /}/ /y++;/ /t++;/ /}/ / / /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ / / /bn->tsorted = BATcount(bn)<2;/ /bn->trevsorted = BATcount(bn)<2;/ / / /BBPkeepref(*ret = bn->batCacheid);/ /BBPunfix(b->batCacheid);/ /return MAL_SUCCEED;/ /}/
PS: I am not a c expert but i can find my way with basic operations and pointers.
Any help or suggestions is appreciated.
Thank you.
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
participants (4)
-
Alberto Ferrari
-
imad hajj chahine
-
Mark Raasveldt
-
Sjoerd Mullender