Lazy Logical Replication

Lazy Logical Replication mk Mon, 05/01/2017 - 21:55

EXPERIMENTAL CODE available in development branch as of May 2017.

Lazy logical replication has been added to the source code to realize an asynchronous logical replication management scheme using change set forwarding. Simplicity and ease of end-user control have been the driving arguments in its development.

What is Lazy Logical Replication?

In a data analytics environment the workload on a database is largely read only. Applications grind the data for business insights and summarizations through visual dashboards. Updates are often collected as (micro-) batches and injected into the data warehouse at regular intervals.

If the ingestion rate increases, updates become more complex, or the number of concurrent data analysis applications rises, it becomes mandatory to create a master/replica infrastructure. The master instance is responsible for handling all updates to the database, and replica instances are created to satisfy the responsiveness required by the applications.

A key observation for this common business scenario is that the replicas may lag a little behind. Because data analysts often look at long term patterns using statistical summarizations, and therefore the outcome is less dependent on what happened during the last minute. Furthermore, the replicated data warehouse is likely to run in a Cloud setting, or a cluster with a shared global filesystem. This creates room to simplify the synchronisation between instances, relying on the services provided by the filesystem. In particular, master and replicas share the same <dbfarm> directory.

The Lazy Logical Replication technique rests on detecting change sets in the persistent tables at the master instance, which are collected in a transactional safe way, and replayed at the replica. Replay can be interrupted to obtain a time-warped copy of the database.

When to consider Lazy Logical Replication?

The goal of this extension module is to ease backup and replication of a complete master database with a time-bounded delay. This means that both the master and the replicas run at a certain beat (eg in seconds) by which information is made available by the master or read by the replicas. Such an instance can be freely used for query workload sharing, database versioning, and (re-)partitioning. For example, a replica can be used to support a web application which also keeps application specific data in the same instance, eg session information.

Tables taken from a master can be protected against updates and inspections in a replica instance using the schema access policies defined by the master. Furthermore, updates against replicated tables are not automatically forwarded to the master. Any transaction change set replay that fails stops the cloning process. By default, only persistent tables are considered for replication, and all constraints maintained by the master are carried over to the replicas. Updates under the 'tmp' schema, ie temporary tables, are ignored.

For a backup, we need either all update logs for the entire lifetime of a database, or a binary database snapshot with a collection of logs that have recorded all changes since the snapshot was created. For a replication, also called a database clone, we take a snapshot and the log files that reflect the recent changes. Then, the logged updates are replayed against the snapshot until a specific point in time or transaction id is reached, as identified by the clone itself.

The underlying assumption of the techniques deployed is that the database resides on a proper (global/distributed) file system to guarantees recovery from most storage system related failures, eg using RAID disks or Log-Structured-File systems.

How to set up a master instance?

The safest way to create a master/replica instance is to start with an empty master database. Alternatively, one can stop the master instance, take a binary copy of the <dbfarm>/<dbname> directory and save it for initialization of the replica. A database instance can be set into the 'master' mode only once using the SQL command:

        CALL master();

An optional path to the log record directory can be given to reduce the IO latency, eg using a nearby SSD, or where there is ample of space to keep a long history, such as an HDD or a cold storage location. By default, the command creates a directory /<path-to>/<dbfarm>/<dbname>/wlc_logs to hold all logs, and a configuration file /<path-to>/<dbfarm>/<dbname>/wlc.config to hold the state of the transaction logs (WLC stands for Work Load Capture). It contains the following <key>=<value> pairs:

        snapshot=<path to a snapshot directory>
        logs=<path to the wlc log directory>
        state=<1: started, 2: stopped>
        batches=<next available batch file to be applied>
        beat=<maximal delay between log files, in seconds>
        write=<timestamp of the last transaction recorded>

A missing snapshot path denotes that we can start a replica from an empty database. The log files are stored as <dbname>_<batchnumber> in the "wlc_logs" directory. They belong to the snapshot. Each WLC log file contains a serial logs of committed compound transactions. The log records are represented as ordinary MAL statement blocks, which are executed in serial mode. Each transaction is identified by a unique id, its starting time, and the responsible database user. The log records must end with a COMMIT to be allowed for re-execution. Log records with a ROLLBACK tag are merely for off-line analysis by the DBA.

A transaction log file is created by the master using a heartbeat in seconds. A new transaction log file is published, after the system has been collecting transaction records for some time. The beat can be set using the SQL command:

        CALL masterbeat(<duration>);

Setting the master heartbeat to zero leads to one log file per transaction, and this may lead to a log directory with potentially a large amount of files. A default of 5 minutes should balance the polling overhead in most practical situations. The log file is shared within 'beat' seconds after the first transaction record was written into it.

The final step in the life time of a master instance is to stop transaction logging with the SQL command:

        CALL stopmaster();

This marks the end-of-life time for a snapshot. For example, when planning to do a large bulk load of the database, stopping logging avoids a double write into the database. The database can only be brought back into the master mode using a fresh snapshot.

One of the key challenges for a DBA is to keep the log directory manageable, because it grows with the speed in which updates are applied to the database. This calls for regularly checking for their disk footprint, and taking a new snapshot as a frame of reference. A master instance has no knowledge about the number of clones and their whereabouts. To ensure transaction ACID properties, the log records are stored on disk within the transaction brackets, which may cause extra I/O pressure. This can be alleviated by storing the database and logs files on an SSD or a Non-Volatile-Memory (NVM) device.

How to make a replica instance?

Every clone starts off with a copy of the binary snapshot identified by 'snapshot'. A fresh database can be turned into a clone using the call:

        CALL replicate('mastername')

It will grab the latest snapshot of the master and applies all available log files before releasing the database. Progress of the replication can be monitored using the -fraw option in mclient.

The clone process will iterate in the background through the log files, applying all updating transactions. An optional timestamp or transaction id can be passed to the replicate() command to apply the logs until a specific moment or transaction. This is particularly useful when an unexpected desastrous user action, eg dropping a persistent table, has to be recovered from.

        CALL replicate('mastername');
        CALL replicate('mastername',NOW()); -- stops after we are in sync
        ...
        CALL replicate(NOW()); -- partial roll forward
        ...
        CALL replicate();      -- continue nondisturbed synchronisation

        SELECT replicaClock(); -- returns the timestamp of the last replicated transaction.
        SELECT replicaTick();  -- returns the transaction id of the last replicated transaction.
        SELECT masterClock();  -- return the timestamp of the last committed transaction in the master.
        SELECT masterTick();   -- return the transaction id of the last committed transaction in the master.

Any failure encountered during a change set replay terminates the replication process, leaving a message in the merovingian log.

Features to be considered beyond the Alpha release

  • Include master/replica management commands into the monetdb control program.
  • Turn off logging of the replicated tables in the replica instance, since we already have them.
  • Roll forward of SQL session variables, i.e. optimizer_pipe. For updates we don't need special care for this.
  • Agents (message buss) to ship the log and snaphot files between different file systems to improve local access speed.
  • Selective replication by partitioning the log files based on predicates.
  • Deploy parallel updates on the change set files.