Transaction Replication

Transaction Replication dnedev Wed, 07/22/2015 - 17:39

As of the Jun2016 release, MonetDB has added high-availability support through transaction-replication (NB, this feature is still somewhat experimental, therefore should be used with great care!).

Transaction-replication is done by shipping the transaction logs from the master instance to a number of replicas. It is generally described as a pull model, where each replica pulls transactions independently and asynchronously. Aside from the transation logs, there is no further master-replica information exchange. Formally this is considered Lazy Centralized replication with Limited Transparency.

By default a MonetDB server stores transactions in Write-Ahead Log (WAL) files, before the transactions are persisted in the primary BATs storage. During the database startup, the transaction log files are read and any data changes non-persisted in the BATs are made persistent. A MonetDB replica can be configured to read the WAL files of a master instance, load the transactions and persist the data in its own BATs.

On the master instance, MonetDB must be configured to keep all transaction log files, even those of the transactions already persisted in its primary storage, because otherwise, the database server cleans-up persisted transaction log files by default. The transaction log files on the master have to be shipped to the replica(s), which can be done using a highly-available shared file system or alternative means.

On a replica instance, the location of the master transaction log files must be configured. In addition, the transaction drift threshold between the replica and the master must be set. The drift is the maximal difference allowed between the transactions processed by the master and the replica. If a replica detects that it has past the set threshold, it will apply the transactions in the WAL files to catch up with the master. Finally, the replica must be set to run in read-only mode.

The master is the only instance that can apply changes to the data (create, insert, update, delete) to avoid any data inconsistencies. As such all replicas must run in read-only mode, where data changes will be propagated only through the transaction-replication mechanism. To guarantee that a replica instance only replicates the master's WAL, a MonetDB instance will never function in replica mode, unless it is started in read-only mode.

Setup

First and foremost, the shipping of the transaction log files between the master and the replicas must be configured. The most straightforward way is to use a shared file system, which also provides high-availability for the master transaction log files. The master and replica instances must run the same version of MonetDB.

Master:

  • The flag "gdk_keep_persisted_log_files" denotes the number of old WAL files, i.e. WAL files of transactions that have already been persisted at the master instance. If the flag is not set (i.e. gdk_keep_persisted_log_files=0, which is the default), during a garbage collection action, the master will clean-up all old WAL files. If the flag is set to N, the master will keep the most recent N WAL files during garbage collection. Without the WAL files, a replica will not be able to read and replicate all transaction. To allow old replicas to catch up, one might set this flag to a fairly large number.
    • gdk_keep_persisted_log_files=1000
  • (Optionally) Set a custom location for the transaction log files. This is useful if you are using a shared file system and/or you want the WALs to be written to a different location than where your database resides (e.g. under a shared FS volume).
    • gdk_logdir=<path to transaction log directory>

Replica:

  • Set the location of the master transaction log files, made availably locally through the initially set replication mechanism.
    • gdk_shared_logdir=<path to master transaction log directory>
  • Set the shared transaction drift threshold. That is how far back the replica may be behind the master. It should be ≥ 0.​
    • gdk_shared_drift_threshold=<desired drift threshold>
  • Set the read-only flag. The replica must always work in read-only more, only replicating transactions from the master, never any on their own.
    • --readonly

Here is an example how to set up transaction replication using a shared file system for the master WAL.

-- Start a master server, set gdk_keep_persisted_log_files to 1 and a directory for the master WAL, placed on a shared filesystem
$ mserver5 --dbpath=<path-to-master-db> --set mapi_port=50000 --set gdk_keep_persisted_log_files=1 --set gdk_logdir=<path-to-shared-fs-for-master-logs>
-- Start a replica server, pointing the gdk_shared_logdir to the perviously shared master WAL dir, set gdk_shared_drift_threshold to the desired value, and enable read-only mode
$ mserver5 --dbpath=<path-to-replica-db> --set mapi_port=50001 --set gdk_shared_logdir=<path-to-shared-fs-for-master-logs> --set gdk_shared_drift_threshold=0 --readonly

Clusters configurations

The recommended cluster configuration is a warm standby, where:

  • There is a single master instance, which can do read and write operations
  • There are one or more replica instances, which process no queries, but only replicate the master transactions asynchronously
  • Upon master failure, a replica instance can be restarted in non-read-only mode, to take over the role of a master.

The warm standby configuration has the advantage that it can provide increased fault tolerance and is relatively simple to set up. Since only the instance loading the data can be queried, the queries operate on an always up-to-date store. Since the replication is done asynchronously, if the master is down, the replacing replica may lag behind the former master. The only way to avoid that is to use a shared file system for the log shipping (see below). This will ensure the in case of master failure, the last logs will be available for the replica replicate and be restarted as master, ASAP. 

Log shipping

We also plan to extend the MonetDB-Daemon (monetdbd) to support log shipping and master-replica cluster configurations, to be used in conjunction with the transaction-replication. Until that time, it is recommended that a shared file system is used for shipping the transaction logs. It is advisable to choose a shared file system that is also fault tolerant, such that loss of a master instance will not lead to loss the master transaction logs as well. This way the shared file system will provide both log shipping, as well as log backup. Since the master instance will preserve all transaction log files, there will be a complete copy of the data of that database (in the form of transactions). Good examples of such file systems include:

If MonetDB is deployed in the cloud, the native storage of the cloud provider can be used. For example, in AWS one can setup WAL file replications on the Amazon Elastic File System (EFS). It has high durability and availability, making it ideal for both log shipping and backup.

MonetDB is primarily designed to be used as an analytical database. As such, data are best loaded in large bulk transactions. This will also guarantee that only single large files are shipped to the replicas for replication, minimising the transaction drift.

Known problems and limitations

  • The master transaction log directory must be available upon a replica start up, otherwise the replica instance will not start.
  • If a replica detects a missing master transaction log file, it will stop, since it will be unable to replicate all master transactions.
  • Although a replica does not refuse to execute read-only queries, this is unsafe. It can cause a replica to crash or corrupt its replicated database.
  • Currently, the database users must ensure the replication and availability of the master transaction log files. This can be done via a shared file system or an alternative mechanism. However, we do plan to build that in monetdbd in the future.
  • Beside the "gdk_keep_persisted_log_files" flag, there is currently no mechanism/call to instruct the master to clean-up the WAL files earlier (after all replicas have been synced). This could lead to increased disk space usage. Adding an external clean-up call is pending