Problems executing custom aggregation function on large datasets

Martin Kersten Martin.Kersten at cwi.nl
Tue Jul 9 19:20:40 CEST 2013


Hi

Indeed, handling aggregates over partitioned tables is not easy
at the MAL layer and is currently handled by the mergetable optimizer.
This optimizer, ofcourse, has no knowledge of possible other user
defined aggregates. Changing this optimizer is far from easy.

regards, Martin

On 7/9/13 5:58 PM, Scott Mathieson wrote:
> Hi,
>
> I was having the same issue, in that the mitosis pipeline is splitting the data in order to parallelize the processing of the data into groups but then failing to pass the group to data OID map to my aggregation function. The only way I found to get round this was to either declare your aggregation function in the aggr module namespace or change the opt_mitosis.c file to check for your module name and return 0  if it finds it (similar to what it does for certain functions in aggr module which is why the first option also works). This does mean that the calculation of the data groups is only done in a single thread.
>
> I my still learning my way around the code so don't yet have a full understanding of how the optimizer wires up the grouped data to the inputs of the aggregation functions.
>
> Regards,
>
> Scott
>
>
> -----Original Message-----
> From: users-list [mailto:users-list-bounces+scott.mathieson=pb.com at monetdb.org] On Behalf Of Sebastian Dorok
> Sent: 04 July 2013 08:35
> To: users-list at monetdb.org
> Subject: Re: Problems executing custom aggregation function on large datasets
>
> Okay got it: executing generated MAL explain plan step by step is bad idea. Debugging it worked better but lead to the same error even with 'nthreads' property of my database set to '1' (also without debugging ;).
>
> But I think I found my problem: My aggregation function is a holistic one like MEDIAN means I can't compute the final result using intermediate results. Luckily I figured out that MITOSIS was enabled in the SQL optimizer pipeline (I don't know why). When disabling it using NO_MITOSIS_PIPE everything works fine because the computation is not splitted horizontally. See attached MCLIENT output for details.
>
> So questions that remain:
> - How do I have to announce/ implement my aggregation function so that it can be computed even if MITOSIS is enabled (like MEDIAN)?
> - And a more general one: How do I implement a non holistic aggregation function? When I look at the explain plan for my aggregation function I can't see that the optimizer tries to compute intermediate results. How to achieve that?
>
>
> On 03.07.2013 15:11, Lefteris wrote:
>> Hi, the error that you are getting has nothing to do in my opinion
>> with your function. There are other reasons. Try to run your queries
>> through SQL mclient, passing one MAL instruction by hand in general
>> will not work.
>>
>> I see in your MAL explain plan
>>
>>        X_16:bat[:any,:str]  := udf.subgenotype(X_10,X_14,r1_27,true);
>>
>> which is your function. X_10 here is the "base" column you want to
>> aggregate and X_14 is the result of subgroupby of the "pos" column.
>> Try to run monetdb with gdk_nr_threads=1 so you only have 1 thread and
>> avoid fragmentation of the base tables, so is simpler to see what is
>> going on.
>>
>>
>>
>> On Wed, Jul 3, 2013 at 2:48 PM, Sebastian Dorok
>> <sebastian.dorok at st.ovgu.de> wrote:
>>> Hi all,
>>>
>>> I'm developing a custom aggregation function. Everything works fine
>>> until I use it on large datasets (ca. 270 Mio. rows).
>>>
>>> Here is what my function signature looks like:
>>>
>>> str AGGSimpleGenotype(int *ret, int *bid, int *gid, int *eid, bit
>>> ignore_nils);
>>>
>>> I assume that the BATs identified by bid and gid have the same length.
>>> That's why my function checks this circumstance. On my test dataset
>>> (ca. 200
>>> rows) everything works fine. When I use my production dataset my
>>> aggregation function returns an error complaining that the BATs have
>>> not the same length and stops aggregating (by design)! How can this happen?
>>>
>>> The SQL query is:
>>> select grpgenotype(base) from test.bases group by pos; I attached the
>>> MAL plan generated for the large dataset. When I execute it step by
>>> step I get an error in line 39 saying:
>>>
>>> mal>X_106 := sql.delta(X_56,X_70,X_71,X_8);
>>> MAPI  = (monetdb) /tmp/.s.monetdb.50000 QUERY = X_106 :=
>>> sql.delta(X_56,X_70,X_71,X_8); ERROR =
>>> !TypeException:user.main[1]:'sql.delta' undefined in: X_106:any :=
>>> sql.delta(X_56:bat[:oid,:str], X_70:bat[:oid,:oid],
>>> X_71:bat[:oid,:str],
>>> X_8:bat[:oid,:any])
>>>
>>> This error occurs before my function is entered but I think it must
>>> be related with my problem. Is it an internal MonetDB problem?
>>>
>>> I'm using the Feb2013-SP2 release.
>>>
>>> Any ideas? Thx.
>>>
>>> _______________________________________________
>>> users-list mailing list
>>> users-list at monetdb.org
>>> http://mail.monetdb.org/mailman/listinfo/users-list
>>>
>> _______________________________________________
>> users-list mailing list
>> users-list at monetdb.org
>> http://mail.monetdb.org/mailman/listinfo/users-list
>>
>
> ________________________________
>
> _______________________________________________
> users-list mailing list
> users-list at monetdb.org
> http://mail.monetdb.org/mailman/listinfo/users-list
>



More information about the users-list mailing list