Bug 6936 - Query randomly failing with Object not found
Summary: Query randomly failing with Object not found
Status: RESOLVED NEEDINFO
Alias: None
Product: SQL
Classification: Unclassified
Component: all (show other bugs)
Version: 11.37.7 (Jun2020)
Hardware: x86_64 (amd64/em64t) Linux
: Normal major
Assignee: SQL devs
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-07-24 14:04 CEST by jpastuszek@protonmail.com
Modified: 2020-09-21 20:19 CEST (History)
3 users (show)



Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description jpastuszek@protonmail.com 2020-07-24 14:04:20 CEST
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0
Build Identifier: 

I am getting queries randomly aborted with error:

State: HY002, Native error: 0, Message: [unixODBC][MonetDB][ODBC Driver 11.37.7]Object not found

The server logs:

2020-07-22 14:54:56 ERR logs[6446]: #DFLOWworker2: createExceptionInternal: !ERROR: MALException:group.subgroup:HY002!Object not found
2020-07-22 14:56:11 ERR logs[6446]: #DFLOWworker3: createExceptionInternal: !ERROR: MALException:group.subgroup:HY002!Object not found
2020-07-22 14:56:17 ERR logs[6446]: #DFLOWworker1: createExceptionInternal: !ERROR: MALException:group.subgroup:HY002!Object not found

This is happening often but only for specific queries and in specific conditions.

The query needs to be in form:

    SELECT
        sys.epoch(CAST('2020-07-15 14:12:00.000' AS TIMESTAMP)) + 0 AS chunk,
        s AS terms
    FROM d
    WHERE s = 'xyz'
    GROUP BY chunk, terms

Where table d is a merge table containing tables with 700k+ records in total, column s is STRING or JSON column (other column types won't trigger this). There has to be a TIMESTAMP related mathematical operation (+ 0 is enough to trigger). GROUP BY has to have two columns (any one alone won't trigger). The WHERE condition does not need to match any rows.
And one of the merged tables needs to have active INSERTs going on in the background with sort lived transactions (e.g. committed every second). The longer the transaction the less the chance of the query to fail but also the longer the time it will stop failing when retried multiple times.

The actual query I simplified was in this form:

    SELECT
        CAST((timestamp - CAST('2020-07-15 14:12:00.000' AS TIMESTAMP)) * 1000 / 5040000 AS BIGINT) AS chunk,
        logsource AS terms,
        CAST(count(*) AS DOUBLE) AS metric,
        count(*) AS count
    FROM logs.message
    WHERE processed_timestamp >= '2020-07-15 06:33:36.471' AND processed_timestamp <= '2020-07-22 22:33:36.471'
	AND timestamp >= '2020-07-15 14:33:36.471' AND timestamp <= '2020-07-22 14:33:36.471'
    GROUP BY chunk, terms

It is a core query to visualize the data (terms histogram) on a dashboard (Grafana) and I can't find a way to refactor it without triggering the failure at random. Also by the design of the system, I need data (logs) to be inserted in a semi-realtime manner in the background.
Retrying the query won't do as it can take many seconds before the query starts working again.

This is happening with Jun2020 and was not happening with Apr2019, it is a serious regression for my use case.

I tried to reproduce this issue without the original data (on synthetic data) set but without luck so far.

Reproducible: Couldn't Reproduce
Comment 1 Ying Zhang cwiconfidential 2020-07-24 14:24:26 CEST
(In reply to jpastuszek@protonmail.com from comment #0)

> Where table d is a merge table containing tables with 700k+ records in
> total, column s is STRING or JSON column (other column types won't trigger
> this). There has to be a TIMESTAMP related mathematical operation (+ 0 is
> enough to trigger). GROUP BY has to have two columns (any one alone won't
> trigger). The WHERE condition does not need to match any rows.
> And one of the merged tables needs to have active INSERTs going on in the
> background with sort lived transactions (e.g. committed every second). The
> longer the transaction the less the chance of the query to fail but also the
> longer the time it will stop failing when retried multiple times.

This is quite an impressive analysis.



Can you give us the schema of your merge table and the ALTER TABLE ... ADD... statements?  And some sample data, if possible?
Comment 2 jpastuszek@protonmail.com 2020-07-24 15:10:59 CEST
Full schema is crated like this:

CREATE MERGE TABLE bug.message (
    logentry_no BIGINT,
    logentry_id STRING,
    processed_timestamp TIMESTAMP,
    timestamp TIMESTAMP,
    logsource STRING,
    logsource_environment STRING,
    logsource_service STRING,
    logsource_location STRING,
    logsource_subsystem STRING,
    program STRING,
    program_type STRING,
    program_name STRING,
    program_log STRING,
    program_source STRING,
    program_thread STRING,
    log_level STRING,
    tags JSON,
    syslog_severity STRING,
    syslog_facility STRING,
    syslog_tag STRING,
    message STRING,
    structured_data JSON
)

CREATE TABLE bug.message_future (
    logentry_no BIGINT GENERATED ALWAYS AS
        IDENTITY (
           START WITH 2999010100000000000 INCREMENT BY 1
           MINVALUE 2999010100000000000 MAXVALUE 2999010199999999999
           CACHE 50 CYCLE
    ),
    logentry_id STRING,
    processed_timestamp TIMESTAMP,
    timestamp TIMESTAMP,
    logsource STRING,
    logsource_environment STRING,
    logsource_service STRING,
    logsource_location STRING,
    logsource_subsystem STRING,
    program STRING,
    program_type STRING,
    program_name STRING,
    program_log STRING,
    program_source STRING,
    program_thread STRING,
    log_level STRING,
    tags JSON,
    syslog_severity STRING,
    syslog_facility STRING,
    syslog_tag STRING,
    message STRING,
    structured_data JSON
)

ALTER TABLE bug.message ADD TABLE bug.message_future

CREATE TABLE bug.message_past (
    logentry_no BIGINT GENERATED ALWAYS AS
        IDENTITY (
           START WITH 1900010100000000000 INCREMENT BY 1
           MINVALUE 1900010100000000000 MAXVALUE 1900010199999999999
           CACHE 50 CYCLE
    ),
    logentry_id STRING,
    processed_timestamp TIMESTAMP,
    timestamp TIMESTAMP,
    logsource STRING,
    logsource_environment STRING,
    logsource_service STRING,
    logsource_location STRING,
    logsource_subsystem STRING,
    program STRING,
    program_type STRING,
    program_name STRING,
    program_log STRING,
    program_source STRING,
    program_thread STRING,
    log_level STRING,
    tags JSON,
    syslog_severity STRING,
    syslog_facility STRING,
    syslog_tag STRING,
    message STRING,
    structured_data JSON
)

ALTER TABLE bug.message ADD TABLE bug.message_past

CREATE TABLE bug.message_20200724 (
    logentry_no BIGINT GENERATED ALWAYS AS
        IDENTITY (
           START WITH 2020072400000000000 INCREMENT BY 1
           MINVALUE 2020072400000000000 MAXVALUE 2020072499999999999
           CACHE 50 CYCLE
    ),
    logentry_id STRING,
    processed_timestamp TIMESTAMP,
    timestamp TIMESTAMP,
    logsource STRING,
    logsource_environment STRING,
    logsource_service STRING,
    logsource_location STRING,
    logsource_subsystem STRING,
    program STRING,
    program_type STRING,
    program_name STRING,
    program_log STRING,
    program_source STRING,
    program_thread STRING,
    log_level STRING,
    tags JSON,
    syslog_severity STRING,
    syslog_facility STRING,
    syslog_tag STRING,
    message STRING,
    structured_data JSON
)

ALTER TABLE bug.message ADD TABLE bug.message_20200724

And many more daily partitions message_*.

The query that will randomly fail (you man need to run it between 1 time to like 100 times before it fails, but once it does it will keep failing for few seconds before recovering):

    SELECT
        sys.epoch(CAST('2020-07-15 14:12:00.000' AS TIMESTAMP)) + 0 AS chunk,
        logsource AS terms
    FROM logs.message
     WHERE logsource = 'xyz'
    GROUP BY chunk, terms  

Note that it won't fail unless I have a process inserting records to a daily partition table (e.g. message_20200724) with short lived transactions.

I will try to minimize this on Monday and perhaps provide a data set for you to load going from this schema down to a simpler one. I tried to generate some data in similar setting but without much luck with reproduction.
Comment 3 Pedro Ferreira 2020-07-24 20:52:47 CEST
I found this error with SQLancer. Maybe it already got resolved for SP1. It had to do with MAL functions missing conditional execution/candidate lists.
Comment 4 jpastuszek@protonmail.com 2020-07-27 18:15:52 CEST
I have upgraded to Jun2020-SP1 and some of the queries that were failing are now fixed but not the queries that I need.

Looks like constant TIMESTAMP is now fixed but if I use a column it is still failing:

    SELECT
        sys.epoch(timestamp) + 0 AS chunk,
        logsource AS terms
    FROM logs.message
    WHERE logsource = 'xyz'
    GROUP BY chunk, terms 

Also if I use a "terms" column with JSON type it works now, and only STRING terms are an issue.

If I disable "mitosis" (SET optimizer='no_mitosis_pipe') all works fine. So at least I have a usable work-around.
Comment 5 Pedro Ferreira 2020-07-28 14:33:49 CEST
On default I disabled the addition between temporal types and numeric types other than intervals, in order to make it compliant with the SQL standard. This may also be the issue on your query. Can you replace the integer constant with an interval?
Comment 6 jpastuszek@protonmail.com 2020-07-30 11:00:03 CEST
This is also failing:

    SELECT
        sys.epoch(sys.epoch(timestamp) + INTERVAL '1' SECOND) AS chunk,
        logsource AS terms
    FROM logs.message
    WHERE logsource = 'xyz'
    GROUP BY chunk, terms
Comment 7 Pedro Ferreira 2020-07-30 11:36:57 CEST
I will be continuing running SQLancer and eventually find this issue.
Comment 8 jpastuszek@protonmail.com 2020-07-31 14:30:03 CEST
OK, this reproduces the issue for me reliably:

CREATE TABLE ts1 (ts TIMESTAMP, s STRING);
CREATE TABLE ts2 (ts TIMESTAMP, s STRING);
CREATE TABLE tl1 (ts TIMESTAMP, s STRING);

INSERT INTO tl1 SELECT now(), 'a' AS s FROM sys.generate_series(0, 200000);

CREATE MERGE TABLE tsl (ts TIMESTAMP, s CLOB);

-- need 3 tables: two small or empty and one large; large needs to be 200k rows or more
ALTER TABLE tsl ADD TABLE ts1;
ALTER TABLE tsl ADD TABLE ts2;
ALTER TABLE tsl ADD TABLE tl1;
    
----- run the above first; then run the below

INSERT INTO ts1 (ts, s) VALUES (now(), NULL);
    
SELECT
    sys.epoch(ts) + 0 AS chunk,
    s AS terms
FROM sys.tsl
WHERE s = 'test'
GROUP BY chunk, terms; 
 
--- SQL Error [HY002]: Object not found
Comment 9 jpastuszek@protonmail.com 2020-08-06 11:55:00 CEST
I have looked at monetdb5/optimizer/opt_mitosis.h and looks like the explanation of the requirements for 200k+ table is there:

#define MINPARTCNT 100000   /* minimal record count per partition */

monetdb5/optimizer/opt_mitosis.c:

153         /* exploit parallelism, but ensure minimal partition size to
154          * limit overhead */
155             pieces = (int) MIN(rowcnt / MINPARTCNT, (BUN) threads);
156         }

We get "pieces > 1" only if the rowcnt is 200k or more. So it is only then when mitosis kick in.

Also the reason for having small and large table in the merge is that it does:  "/* locate the largest non-partitioned table */" first.

Also disabling mitosis does make the issue to go away.
Comment 10 Niels Nes cwiconfidential 2020-09-09 13:54:12 CEST
running your latest example script (with mserver --forcemito ie always partition), seems to not have the issue anymore (on oct2020 branch).
Could you verify this?