Distributed Query Processing

Distributed Query Processing zhang Sun, 07/19/2015 - 17:43

Distributed query processing has been introduced since the Jul2015 release of MonetDB through the support for remote tables. The remote table technique complements merge table by allowing the partitions of a merge table to reside on different databases. Queries involving remote tables are automatically split into subqueries by the master database and executed on remote databases. The combination of merge table and remote table enables fine control of how to distribute data and query workloads to maximally benefit from the available CPU and RAM resources.

Setting up REMOTE TABLE-s

Remote table adopts a straightforward master-worker architecture: one can place the partition tables in different databases, each served by a worker MonetDB server, and then glue everything together in a MERGE TABLE in the master database served by the master MonetDB server. Each MonetDB server can act as both a worker and a master, depending on the roles of the tables it serves.

The following shell commands and SQL queries show how to place the partition tables t1 and t2 on two worker databases, and glue them together on a master database. The format of the address of a REMOTE TABLE is: mapi:monetdb://<host>:<port>/<dbname>, where all three parameters are compulsory. A worker database can reside on both a local cluster node and a remote machine. N.B.:

  1. The declaration of a REMOTE TABLE must match exactly the signature of its counterpart in the remote database, i.e., the same table name, the same columns names and the same column data types.
  2. Currently, at the creation time of a remote table, the remote database server is not contacted to verify the existence of the table. When a CREATE REMOTE TABLE report “operation successful”, it merely means that the information about the new remote table has been added to the local SQL catalogue. The check at the remote database server is delayed until the first actual query on the remote table.

    -- Start a server and client for worker-database1, and create partition table1
    $ mserver5 --dbpath=<path-to>/rt1 --set mapi_port=50001
    $ mclient –d rt1 –p 50001
    sql> CREATE TABLE t1 (i int);
    sql> INSERT INTO t1 VALUES (11), (13);

    -- Start a server and client for worker-database2, and create partition table2
    $ mserver5 --dbpath=<path-to>/rt2 --set mapi_port=50002
    $ mclient –d rt2 –p 50002
    sql> CREATE TABLE t2 (j int);
    sql> INSERT INTO t2 VALUES (23), (27);

    -- Start a server and client for the master-database,
    --   and create a MERGE TABLE containing two REMOTE TABLEs
    $ mserver5 --dbpath=<path-to>/mst
    $ mclient –d mst
    sql> CREATE MERGE TABLE mt1 (t int);
    sql> -- Identify t1, t2 as REMOTE TABLEs with their locations
    sql> CREATE REMOTE TABLE t1 (i int) on ‘mapi:monetdb://localhost:50001/rt1’;
    sql> CREATE REMOTE TABLE t2 (j int) on ‘mapi:monetdb://localhost:50002/rt2’;
    sql> -- Add the remote tables into the MERGE TABLE
    sql> ALTER TABLE mt1 ADD TABLE t1;
    sql> ALTER TABLE mt1 ADD TABLE t2;
    sql> -- Sanity check:
    sql>SELECT count(*) from t1;
    | L1   |
    |    2 |
    1 tuple (9.389ms)
    sql>SELECT count(*) from t2;
    | L1   |
    |    2 |
    1 tuple (8.520ms)
    sql>SELECT count(*) from mt1;
    | L1   |
    |    4 |
    1 tuple (10.481ms)