Re: R aggregation fails when input is exactly 200K tuples

1 Oct 2015

      The previous example is a simplification of a real aggregate I'm working
on, hence the two columns.

I just tried on one column, and it still fails at exactly 200000 tuples in
input.

Here, however, I get a SIGSEGV.

This is a reproducible example:

START TRANSACTION;

-- the (fake) aggregate function
CREATE AGGREGATE sow(n int) RETURNS DOUBLE LANGUAGE R {
  sow_aggr <- function(df) { 42.0 }

  aggregate(n, list(aggr_group), sow_aggr)$x
};

-- function to generate input data
CREATE FUNCTION tt() RETURNS TABLE (g int, n int) LANGUAGE R {
  g <- rep(1:500, rep(400,500))
  data.frame(g,as.integer(10))
};

CREATE TABLE good as select * from tt() limit 199999 with data;
CREATE TABLE bad as select * from tt() limit 200000 with data;

select count(distinct g) from good;
select count(distinct g) from bad;

select g, sow(n) from good group by g;
select g, sow(n) from bad group by g;

ROLLBACK;

On 1 October 2015 at 10:06, Roberto Cornacchia 
...
wrote:
...
Hi,
I have a suspect failure at exactly 200K tuples in input.
I declared a simple aggregate on two columns (here it doesn't aggregate,
for simplicity, it just returns 42).
START TRANSACTION;
CREATE table test (customer int, d string, n int);
INSERT INTO test VALUES(1,'2015-01-01', 100);
INSERT INTO test VALUES(1, '2015-01-02', 100);
INSERT INTO test VALUES(2, '2015-01-03', 100);
INSERT INTO test VALUES(2, '2015-01-01', 100);
INSERT INTO test VALUES(2, '2015-01-02', 100);
CREATE AGGREGATE sow(d string, n int) RETURNS DOUBLE LANGUAGE R {
# aggregation function is constant, not important here
  sow_aggr <- function(df) {
   42.0
  }
df <- cbind(d,n)
  as.vector(by(df, aggr_group, sow_aggr))
};
select customer, sow(d,n) from test group by customer;
ROLLBACK;
+----------+--------------------------+
| customer | L1                       |
+==========+==========================+
|        1 |                       42 |
|        2 |                       42 |
+----------+--------------------------+
The result is what I had expected. That is true until table test is long
199999 tuples. When it's exactly 200000 tuples, I get:
Error running R expression. Error message: Error in
tapply(seq_len(200000L), list(INDICES = c(0, 1, 2, 3, 4, 5, 6,  :
  arguments must have same length
Calls: as.data.frame ... by.data.frame -> structure -> eval -> eval ->
tapply
I checked the vector aggr_group, and indeed it is not 200000 long, as it
should be. Instead, it is just one longer than then number of distinct
values for customer (the grouping column).
Any thought?
Roberto