Hi,
I have a suspect failure at exactly 200K tuples in input.
I declared a simple aggregate on two columns (here it doesn't aggregate, for simplicity, it just returns 42).
START TRANSACTION;
CREATE table test (customer int, d string, n int);
INSERT INTO test VALUES(1,'2015-01-01', 100);
INSERT INTO test VALUES(1, '2015-01-02', 100);
INSERT INTO test VALUES(2, '2015-01-03', 100);
INSERT INTO test VALUES(2, '2015-01-01', 100);
INSERT INTO test VALUES(2, '2015-01-02', 100);
CREATE AGGREGATE sow(d string, n int) RETURNS DOUBLE LANGUAGE R {
# aggregation function is constant, not important here
sow_aggr <- function(df) {
42.0
}
df <- cbind(d,n)
as.vector(by(df, aggr_group, sow_aggr))
};
select customer, sow(d,n) from test group by customer;
ROLLBACK;
+----------+--------------------------+
| customer | L1 |
+==========+==========================+
| 1 | 42 |
| 2 | 42 |
+----------+--------------------------+
The result is what I had expected. That is true until table test is long 199999 tuples. When it's exactly 200000 tuples, I get:
Error running R expression. Error message: Error in tapply(seq_len(200000L), list(INDICES = c(0, 1, 2, 3, 4, 5, 6, :
arguments must have same length
Calls: as.data.frame ... by.data.frame -> structure -> eval -> eval -> tapply
I checked the vector aggr_group, and indeed it is not 200000 long, as it should be. Instead, it is just one longer than then number of distinct values for customer (the grouping column).