Distributed Data
Why distributed a database acorss multiple machines:
- Scalability
- Fault tolerance/high availability, redundency.
- Latency, geographically chooce the servers.
Scaling to Higher Load
- shared-memory architecture, all the components can be treated as a single machine. Costy.
- hign-end machines have hot-swappable components(replace disks, memory modules, and even CPUs without shutting down the machines).
- shared-disk architecture, used for some data warehousing workloads, connection and the overhead of locking limit the scalability.
- shared-nothing architecture, scale out.
- Each machine or virtual machine running in the database software is called a node.
Replication Versus Partitioning
- Replication - provides redundancy
- Partitioning - different partitions can be assigned to different nodes.(as known as sharding)
They often go hand in hand. Partitioning with Replication.
Replication
Replication means keeping a copy of the same data on multiple machines that are connected via a network.
Why replication?
- Keep data geographically close to your users, thus reduce latency
- Allow system running without downtime even some of its parts have failed.
- To scale out the number of machines, increase the read thoughput.
This need your dataset is so small that each machine can hold a copy of the entire dataset.
- Replicating changes between nodes
- single-leader
- multi-leader
- leaderless
Leaders and Followers
Every write to the database needs to be processed by every replica.
- Leader-based replication(active/passive or master/slave replication)
- One of the replicas is desinated the leader, first writes the new data to its local storage.
- The other replicas are known as followers(read replicas, slaves, secondaries, or hot standbys). Leader will sends the data change to all of its followers as part of a replication log or change stream.
- Clients can either require the leader or any of the followers to read from the database.
This is a built-in feature of many reloational databases, PostgreSQL, MySQL, Oracle Data Guard, and SQL Server’s.
Also used in some nonrelational databases, including MongoDB, RethinkDB, and Espresso.
Not only databases, distributed message brokers such as Kafka and RabbitMQ highly available queues also use it.
Synchronous Versus Asynchronous replication
- Synchronous replication, the leader will wait until follower 1 has confiermed that it received the write before reporting sucess to the user.
- The followers are garanteed to have an up-to-date copy of the data.
- But if the followers don’t respond, the write connot be processed, the leader must block all writes and wait.
- Any one node outage would cause the whole system to grined to a halt.
- Asynchronous replication, the leader sends the message, but doesn’t wait for a response from the follower.
Semi-synchronous
In practice, one of the followers is sychronous and the others are asynchronous, if the synchronous follower becomes unavailable or slow, one of the asynchronous followers is made synchronous. This garantees that you have an up-to-date copy of the data on at least two nodes: the leader and the synchronous follower.
Asynchronous Replication
Often, a leader-based replication is configured to be completely asynchronous. Because the leader can continue processing writes, but a write is not guaranteed to be durable, even if it has been confiermed to the client.
It is nevertheless widely used.
- Chain replication - successfully implemented in Microsoft Azure Storage.
- Consistency of Replication
- Consensus(getting several nodes to agree on a value).
Setting Up New followers
- Lock the database(make it unavailable for writes), but this go against our goal of high availability.
- Done without downtime:
- Take a consisten snapshot of the leader’s database at some point in time. Third-party tools are needed in some cases, such as innobackupex for MySQL.
- Copy the snapshots to the new folloer node.
- The follower connects to the leader and requests all the data changes that have happened since the snapthot was taken. The snapshot needs to be associated with an exact position in the leader’s replication log. That position is named(log sequence number in PostgreSQL, or binlog coordinates in MySQL.).
- When the follower has processed the backlog of data changes since the snapshot, we say it has caught up. It can now continue to process data changes from the leader as they happen.
Handling Node Outages
Any node might go down for unexpected reason(a fault) or a planned maintenance(rebooting for kernel security patch installation)
- Goal: keep the system as a whole running despite individual node failures.
Follower failure: Catch-up recovery
- On its local dist, each follower keeps a log of data changes it has received from the leader.
- Recover from the log.
Leader failure: Failover
- failover - one of the followers needs to be promoted to be the new leader, clients need to reconfigured to send their writes to the new leader, and the other folloers need to start consuming data changes from te new leader.
Failover can happen manually or automatically. An automatic failover process:
- Determining that the leader has failed. (crashes, prower outages, network issues), use a foolproof way of detecting: timeout. Nodes frequently bounce message back and forth between each other, if a node doesn’t respond for some period of time-say,30 seconds-it is assumed to be dead.
- Chooseing a new leader. election process, the leader is chosen by a majority of the remaining replicas. Or a new leader could be appointed by a previously elected controller node.
- The best candidate is usually the replica with the most up-to-date data changes from the ode leader(to minimize any data loss).
- Reconfiguring the system to use the new leader. When the old leader comes back, it is a follower and recognizes the new leader.
Failover if fraught with things that can go wrong:
- Asynchronous repliation is used, the new leader might missed some writes from the old leader. When old leader comes back, what to do with the unreplicated writes? simply discard?
- Discarding writes is especially dangerous if other storage systems outside of the database need to coordinated with the database contents.
- If two nodes believe they are the leader, this is called “spilit brain”. Dangerous. Some systems have a mechanism to shut down one node.
- What is the right timeout? Unnecessary failover by temporary load spike? network glitch could cause delayed packets.
Implementation of Replication Logs
Statement-based Replication
- every INSERT, UPDATE, DELETE statement is forwarded to followers.
- Problems:
- Some statements call a nondeterministic function such as NOW(), RAND() actually generate a different value on each replica.
- The order must be the same if the statments use an autoincrementing column.
- Side effects might happen.
Write-ahead log(WAL) shipping
Storage engine represent data on disk:
- log-structured storage engine(SSLTables and LSM-Trees), Log segments are compacted and garbage-collected in the backgourn.
B-tree, overwrites individual disk blocks, every modification is first written to a write-ahead log to restore the indexes.
The log is the append-only sequence of bytes containing all writes to the database.
- The follower processes this log, it builds a copy of the exact same data structure.
- Coupled too close with storage engine: The log describes the data on a very low level: a WAL contains details of which bytes were changed in which disk blocks.
- If the replication protocol allows the follower to use a newer software version than the leader, you can perform a zero-downtime upgrade by first upgrading the followers and them performing a failover to make one the upgraded nodes te new leader.
Logic(row-based) log Replication
Use different log formats for replication and for the storage engine, decoupled from each other.logical log: a sequence of records describing writes to database tables at the granularity of a row
- inserted row: new values of all columns.
- deleted row: uniquely identify the deleted row (primary key or all columns o/w)
- updated row, at least the new values of all columns that changed.
easily to keep backward compatible.
- A logical log format is also easier for external applications to parse.
- Send the contents of a database to an external system
- e.g. data warehouse for offline analysis, or for building custom indexes and caches
- “change data capture”
Trigger-based Replication
- move replication up to the application layer(not implemented by the database system)
- trigger
- let you register custom application code that is automatically executed when a data change(write transaction) occurs in the database system.
- log the change into a separate table, from which it can be read by an external process.
- flexfible, however, has greater overheads and is more prone to bugs and limitations.
Problems with Replication Lag
- read-scaling architecture, increase the capacity for serving read-only reqeusts simply by adding more followers.
- Not praticle with a fully synchronous configuration(one node dies, all die).
- Asynchronous followers may see outdated information if the follower has fallen behind. - inconsistency
- But inconsistency is temporary - eventual consistency
- The delay between a write happening on the leader and beging reflected on a follower is called the replication lag.
Reading Your Own Writes
- user wants to read it shortly after making a wirte, but the leader hasn’t send it to the replica.
- read-after-write consistency is needed. Or read-your-writes-consistency.
- Possible techniques:
- Read from the leader that the user modified, otherwise, read it from a follower. Knowing whethere something might have been modified without actually querying it.
- If most things in the application are potentially editable by the user, use other criteria to decide whether to read from the leader. e.g. read from the leader if it is only one minute after the last update. Or monitor the replication lag to prevent querys on any follower that is more than one minute behind the leader.
- Clients can remember the timestamp of its most recent write to decide whether any follower is outdated. Logical timestamp or the actual system clock(clock sychronization).
- If your replicas are distributed across multiple datacenters, any request that needs to be served by the leader must be routed to the datacenter that contains the leader.
- cross-device read-after-write consistency
Problems:- How to remembering the timestamp(centralize the metadata on all devices)
- There is no guarantee that connections from different devices will be routed to the same datacenter.
Monotonic Reads
- moving backward in time - The latter read is routed to a more lag replica.
- Monotonic reads: you will not see time go backward.
- make sure each user always makes their read from the same replica. The replica can be chosen based on a hash of the user ID, rather than randomly.
Consisten Prefix Reads
Reading the replicas with different lags which disroders the conversation between other writes. (answering becomes the qeustions)
- Require that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order.
- Problematic in partitioned databases, make sure that any writes that are causually realted to each other are written to the same partition.
Solution for Replication Lag
- Pretending that replication is synchronous when in fact it is asyncronous is a recipe for problems down the time.
- transactions: for a database to provide stronger guarantees so that the application can be simpler.
Multi-Leader Replication
Multi-layer configuration(master/master or active/active replication)
You have a leader in each datacenter, leader communicates with leader, and replicates to its own followers.
Performance
- every write can be processed in the local datacenter, and is replicated asynchrously to the other datacenters. perceived performance may be better.
- Tolerance of datacenter outages
- Each datacenter is independently operating.
- Tolerance of network Problems
- Traffic between datacenters usually goes over the public internet, less reliable than the local network within a datacenter. A temporary network interruption does not prevent writes being processed with multi-leader.
Clients with offline operation
When disconnected with internet, make every devices have their own (writes)changes offline, and get synced with a server and other services when the device is next online.
Collaborative editing
make the unit of change very small and avoid locking.
Handling Write Conflicts
Synchronous versus asynchronous conflict detection
Asynchronously detect will lead the confliction detection too late. Synchronous detection will be just like the single-leader database(not allowing each replica to accept writes independently).
Conflict avoidance
ensure that all writes for a particular record go through the same leader.
But this requires the datacenter that the user can write will never down or changed(when user move to somewhere closer to other datacenter), otherwise, conflict avoidance breaks.
Converging toward a consistent state
The order of a sequence of writes in a multi-leader configuration is hard to keep consistent(each one takes the last one wins to be the final value).
- Every replication schema must ensure that the data is eventually the same in all replicas. Thus the database must resolve the conflict in a convergent way.
- Give each write a unique ID(timestamp, long random number, UUID, or a hash key value) - last write wins(LWW).
- Give each replica a unique ID, and let writes that originated at a higher-numbered replica always take precedence over writes that originated at a lower numbered replication.
- Somehow merge the values together.(B and C -> B/C)
- Record the conflict in an explicit data structure that preserves all information and write aplication code that resolves the conflict at some later time.
Custom conflict resolution logic
The application code for conflict resolution logic:
- On write, as soon as the database system detects a conflict in the log of replicated changes, it calls the conflict handler. - Bucardo
- On read, all conflicting writes are stored, the next time the data is read, these multiple versions of the data are returned to the application.(prompt the user or automatically reslove the conflict, and write back) - CouchDB.
Automatic Conflict resolution
- Conflict-free replicated datatypes(CRDTs) are a family of data structures for sets, maps, ordered lists, counters, etc. Automatically resolve conflicts.
- Mergeable persistent data structures, track history explicitly, similar to Git version control system use a three-way merge function.
- Operational transformation - Etherpad, Google Docs use this conflict resolution algorithm, particularly designed for concurrent editing of an ordered list of items.
Multi-Leader Replication Topologies
- Circular topology
- Star topology
- All-to-all topology
Circular and Star topology need to pass through several nodes before it reaches all replicas. To prevent infinite replication loops, each node has a unique identifier and in the replication log, and each write is tagged with the identifiers of all the nodes it has passed through. So when a node receives a write with the tag of own, it will ignore it.
- Problem of Circular and Star is one node fails will cause influence to all other nodes related.
More dense the network, the fault tolerance will be better.
All to all also has problem. Some network links may be faster than others, with the result that some replication message may overtake others.
make sure all nodes process the insert first, and then the update. (simply timestamp still has problem of synchronization)
- version vectors can be used
Leaderless replication
Almost been forgotten. Amazon used it for its in-house Dynamo system Riak, Cassandra. Dynamo-style.
the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client.
Writing to the Database When a Node Is Down
- leader-based configuration, perform a failover(recover from log)
- leaderless configuration, ignores the fact that one of the replicas missed the write. When the missed node comes back, you may get stale(outdated) values as responses. So, the read request will be sent to all replicas nodes in parallel. Version numbers are used to determine different values responded by stale and up-to-date nodes.
Read repair and anti-entropy
For the unavailable node catch up on the writes when it comes back:- Read repair: When a client makes a read from several nodes in parallel, it can detect any state responses. See which one has the newer value, write the newer value to all other falling back replicas. This way is good for frequently read.
- Anti-entropy process: a background process that constatnly looks for differences in the data between replicas and copies any missing data from one replica to another. (Guarantee the data that is rarely read can be updated)
Quorums for reading and writing
- there are n replicas, at least w nodes to be considered successful, and we must query at least r nodes for each read: r>n-w or r+w > n.
- If obey these value ineuqlity, are called quorum reads and writes.
Configurable value:
- If w<n, it can still process writes if a node is unavailable
- If r<n, we can still process reads if a node is unavailable
- Normally, reads and writes are always sent to all n replicas in parallel, r and w parameters are used to determine how many nodes to wait for(can consider as successful).
Limitations of Quorum Consistency
- Usually w and r are chosen to be a majority (more than n/2) of nodes. But quorums only matters that the sets of nodes used by the read and write operations overlap in at least one node.
You may also set w and r to smaller numbers, so that w+r<n. more likely to read stale values. This allows lower latency and higher availability.
It is wise to not take the parameters w and r as absolute guarantee, there are some edge cases.
Monitoring staleness
You need to be aware of the health of your replication.
- In leader-based replication, there are metrics for the replication lag.
- In leaderless replications, there is no fixed order in which writes are applied. Not yet common pratice, but it would be good to include staleness measurements in the standard set of metrics for databases.
Sloppy Quorums and Hinted Handoff
Databases with appropriately configured quorums can tolerate the failure of individual nodes without the need for failover.
When in a large cluster, the client can connect to some database nodes during the network interuption, but not to the quorum needs, there is a trade-off:
- Return errors to all requests for which we cannot reach a quorum of w or r nodes?
- Accept writes anyway, and write them to some nodes that are reachable. - Sloppy quorum
Sloppy quorum means writes and reads still require w and r successful responses, but those may include nodes that are not among the desinated n “home” nodes for a value.
- Hinted handoff: one the network interuption is fixed, any writes that on node temporarily accepted on behalf of another node are sent to the approporiate “home” node.
- Increasing write availability, but still not gurantee when w+r>n, because there may be some latest value temporarily written to some nodes outsides of n. Until the hinted handoff has completed.
Multi-datacenter operation
Leaderless replication is also suitable for multi-datacenter operation, since it is designed to tolerate conflicting councurrent writes, network interruptions and latency spikes.
Detecting Concurrent Writes
Dynamo-style databases allow several clients to concurrently write to the same key, conflicts will occur, during read repair or hinted handoff.
- Events may arrive in different order at different nodes.
- For eventually consistent, the replicas should converge toward the same value.
Last write wins (discarding concurrent writes)
- declare that each replica need only store the most “recent” value and allow “older” values to be overwritten and discarded. But in leaderless, no order to define “recent”.
- So we can force an arbitrary oder on them, attach timestamp, and pick the highest, discard the others.
- achieves the goal of eventual convergence, but at the cost of durability.
- Key need to be only written once and therefore treated as immutable. Thus, use UUID, unique ID.
The “happen-before” relationship and concurrency
How to decide whether two operations are concurrent or not?
- An increment by B is based on an insertion by A -> B is causal dependent on A.
- Two writes are performaning on the same key while doesn’t know each other -> no causal dependency between the operations.
- Say that two operations are concurrent if neithwer happens before the other.
To define concurrency, exact time doesn’t matter, we simply call two operation concurrent if they are both unaware of each other.
Capturing the happens-before relationship
tag the write with version number. same number writes are concurrent.
Merging concurrently written values
- called siblings in Riak.
- Take union to avoid losing data, but will reappear the deleted item. So we needs to mark the deletion with tombstone.
Version vectors
use a version number per replica as well as per key. - Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas.
- dotted version vector