Replication

Committeddb is a distributed system composed of write ahead logs (WAL) that are replicated through the rock solid Raft consensus algorithm written by the Etcd team.

Raft

A Committeddb cluster runs a single Raft cluster which is records its data in the main WAL. All data is stored in this WAL including cluster metadata and user data. Raft provides guarantees that the WALs on all nodes in the system are in order and contain identical information up to the latest commit the node has received. A node may contain less information than another node due to unavailability, but the information it does contain will be identical in order and data contents with all other nodes in the cluster. Raft also takes care of catching up nodes after they become available again. The link above contains an explanation of Raft along with a visualization of how it works.

When you write information to a Committeddb node it forwards it to the current Raft leader (if it isn't the leader itself). The leader makes the decision to accept the proposal and then tells it's peers of the new update. The proposal is not accepted until a quorum (n/2+1) has accepted the proposal. Before a node accepts the proposal it writes it to the disk based WAL providing strong durability semantics.

Primary WAL

Every node in the cluster has a Primary WAL with a record of every accepted proposal in the cluster. Although the Raft requires a quorum to be able to accept writes, data integrity is preserved if just one node has an intact Primary WAL. This allows recovery even in the face of catastrophic failure where n-1 nodes lose all data.

Since a WAL stores data sequentially, querying this type of data structure in a random manner is highly inefficient. To guard against that CommittedDB relies on two caches.

Snapshot

The first cache is a snapshot of metadata. This adds a snapshot record to the primary WAL and creates a separate file on the filesystem containing the contents of the snapshot. These snapshots contain Raft state, cluster configuration, and metadata describing syncables, databases, topics, etc. When the cluster starts up the latest snapshot is loaded which contains a WAL index and then the primary WAL is read from the WAL index to catch up on any metadata added after the snapshot was taken.

Topic WAL

The second cache is the topic WAL. Each Topic has it's own disk based WAL which contains only the records destined for it's topic. This helps syncables be more performant because they only need to read through the WAL of the topic that they are associated with instead of the entire Primary WAL.

Tradeoffs

As with any distributed systems, there is no silver bullet. The goal of committeddb is to be a system of record that values durability and streamability of records. Most of the engineering tradeoffs have been made with those two values in mind.

  • Raft is designed to be used in small clusters. The number of nodes in the cluster should be odd and since the overhead of Raft communication scales poorly, the cluster should have 9 or less nodes. This means that Committeddb clusters won't scale as well as Kafka clusters, but they will scale competitively against a SQL database or MongoDB.

  • One of the core concepts we wanted to capture was the ability to multiplex multiple topics so that new topics containing a unified value added data view could be created. Because of that a single Raft cluster is run. This means we have stable ordering across the cluster (unlike Kafka which has stable ordering across the partition). If write A comes to topic foo, then write B goes to topic bar, then write C goes to topic foo we remember that ordering and can take advantage of that knowledge. This ability limits us to one Raft per cluster.

  • WALs provide abysmal random querying ability. Instead of making up for this with a structured view of the data that is built for querying we provide a simple and performant way for syncing the data into the multitude of databases that have really strong query stories. Since we focus on keeping your data integrity and providing strong data movement semantics we open up new use cases that were hard to support in the past: ability to query data in the most efficient manner, A/B testing of databases, value adding to data through multiplexing, replayable data that can be run against new algorithms, ease of movement of data to BI tools, etc.