Hi,

I have a suspect failure at exactly 200K tuples in input.

I declared a simple aggregate on two columns (here it doesn't aggregate, for simplicity, it just returns 42).


START TRANSACTION;

CREATE table test (customer int, d string, n int);
INSERT INTO test VALUES(1,'2015-01-01', 100);
INSERT INTO test VALUES(1, '2015-01-02', 100);
INSERT INTO test VALUES(2, '2015-01-03', 100);
INSERT INTO test VALUES(2, '2015-01-01', 100);
INSERT INTO test VALUES(2, '2015-01-02', 100);


CREATE AGGREGATE sow(d string, n int) RETURNS DOUBLE LANGUAGE R {

  # aggregation function is constant, not important here
  sow_aggr <- function(df) {
   42.0
  }

  df <- cbind(d,n)
  as.vector(by(df, aggr_group, sow_aggr))
};

select customer, sow(d,n) from test group by customer;

ROLLBACK;
+----------+--------------------------+
| customer | L1                       |
+==========+==========================+
|        1 |                       42 |
|        2 |                       42 |
+----------+--------------------------+


The result is what I had expected. That is true until table test is long 199999 tuples. When it's exactly 200000 tuples, I get:

Error running R expression. Error message: Error in tapply(seq_len(200000L), list(INDICES = c(0, 1, 2, 3, 4, 5, 6,  : 
  arguments must have same length
Calls: as.data.frame ... by.data.frame -> structure -> eval -> eval -> tapply

I checked the vector aggr_group, and indeed it is not 200000 long, as it should be. Instead, it is just one longer than then number of distinct values for customer (the grouping column).

Any thought?

Roberto