11 min read

In this article by Robert Strickland, author of the book Cassandra 3.x High Availability – Second Edition, we will discuss how for any given call, it is possible to achieve either strong consistency or eventual consistency. In the former case, we can know for certain that the copy of the data that Cassandra returns will be the latest. In the case of eventual consistency, the data returned may or may not be the latest, or there may be no data returned at all if the node is unaware of newly inserted data. Under eventual consistency, it is also possible to see deleted data if the node you’re reading from has not yet received the delete request.

(For more resources related to this topic, see here.)

Depending on the read_repair_chance setting and the consistency level chosen for the read operation, Cassandra might block the client and resolve the conflict immediately, or this might occur asynchronously. If data in conflict is never requested, the system will resolve the conflict the next time nodetool repair is run.

How does Cassandra know there is a conflict? Every column has three parts: key, value, and timestamp. Cassandra follows last-write-wins semantics, which means that the column with the latest timestamp always takes precedence.

Now, let’s discuss one of the most important knobs a developer can turn to determine the consistency characteristics of their reads and writes.

Consistency levels

On every read and write operation, the caller must specify a consistency level, which lets Cassandra know what level of consistency to guarantee for that one call. The following table details the various consistency levels and their effects on both read and write operations:

Consistency level

Reads

Writes

ANY

This is not supported for reads.

Data must be written to at least one node, but permits writes via hinted handoff. Effectively allows a write to any node, even if all nodes containing the replica are down. A subsequent read might be impossible if all replica nodes are down.

ONE

The replica from the closest node will be returned.

Data must be written to at least one replica node (both commit log and memtable). Unlike ANY, hinted handoff writes are not sufficient.

TWO

The replicas from the two closest nodes will be returned.

The same as ONE, except two replicas must be written.

THREE

The replicas from the three closest nodes will be returned.

The same as ONE, except three replicas must be written.

QUORUM

Replicas from a quorum of nodes will be compared, and the replica with the latest timestamp will be returned.

Data must be written to a quorum of replica nodes (both commit log and memtable) in the entire cluster, including all data centers.

SERIAL

Permits reading uncommitted data as long as it represents the current state. Any uncommitted transactions will be committed as part of the read.

Similar to QUORUM, except that writes are conditional based on the support for lightweight transactions.

LOCAL_ONE

Similar to ONE, except that the read will be returned by the closest replica in the local data center.

Similar to ONE, except that the write must be acknowledged by at least one node in the local data center.

LOCAL_QUORUM

Similar to QUORUM, except that only replicas in the local data center are compared.

Similar to QUORUM, except the quorum must only be met using the local data center.

LOCAL_SERIAL

Similar to SERIAL, except only local replicas are used.

Similar to SERIAL, except only writes to local replicas must be acknowledged.

EACH_QUORUM

The opposite of LOCAL_QUORUM; requires each data center to produce a quorum of replicas, then returns the replica with the latest timestamp.

The opposite of LOCAL_QUORUM; requires a quorum of replicas to be written in each data center.

ALL

Replicas from all nodes in the entire cluster (including all data centers) will be compared, and the replica with the latest timestamp will be returned.

Data must be written to all replica nodes (both commit log and memtable) in the entire cluster, including all data centers.

As you can see, there are numerous combinations of read and write consistency levels, all with different ultimate consistency guarantees. To illustrate this point, let’s assume that you would like to guarantee absolute consistency for all read operations. On the surface, it might seem as if you would have to read with a consistency level of ALL, thus sacrificing availability in the case of node failure.

But there are alternatives depending on your use case. There are actually two additional ways to achieve strong read consistency:

  • Write with consistency level of ALL: This has the advantage of allowing the read operation to be performed using ONE, which lowers the latency for that operation. On the other hand, it means the write operation will result in UnavailableException if one of the replica nodes goes offline.
  • Read and write with QUORUM or LOCAL_QUORUM: Since QUORUM and LOCAL_QUORUM both require a majority of nodes, using this level for both the write and the read will result in a full consistency guarantee (in the same data center when using LOCAL_QUORUM), while still maintaining availability during a node failure.

You should carefully consider each use case to determine what guarantees you actually require. For example, there might be cases where a lost write is acceptable, or occasions where a read need not be absolutely current. At times, it might be sufficient to write with a level of QUORUM, then read with ONE to achieve maximum read performance, knowing you might occasionally and temporarily return stale data. Cassandra gives you this flexibility, but it’s up to you to determine how to best employ it for your specific data requirements. A good rule of thumb to attain strong consistency is that the read consistency level plus write consistency level should be greater than the replication factor.

If you are unsure about which consistency levels to use for your specific use case, it’s typically safe to start with LOCAL_QUORUM (or QUORUM for a single data center) reads and writes. This configuration offers strong consistency guarantees and good performance while allowing for the inevitable replica failure.

It is important to understand that even if you choose levels that provide less stringent consistency guarantees, Cassandra will still perform anti-entropy operations asynchronously in an attempt to keep replicas up to date.

Repairing data

Cassandra employs a multifaceted anti-entropy mechanism that keeps replicas in sync. Data repair operations generally fall into three categories:

  • Synchronous read repair: When a read operation requires comparing multiple replicas, Cassandra will initially request a checksum from the other nodes. If the checksum doesn’t match, the full replica is sent and compared with the local version. The replica with the latest timestamp will be returned and the old replica will be updated. This means that in normal operations, old data is repaired when it is requested.
  • Asynchronous read repair: Each table in Cassandra has a setting called read_repair_chance (as well as its related setting, dclocal_read_repair_chance), which determines how the system treats replicas that are not compared during a read. The default setting of 0.1 means that 10 percent of the time, Cassandra will also repair the remaining replicas during read operations.
  • Manually running repair: A full repair (using nodetool repair) should be run regularly to clean up any data that has been missed as part of the previous two operations. At a minimum, it should be run once every gc_grace_seconds, which is set in the table schema and defaults to 10 days.

One might ask what the consequence would be of failing to run a repair operation within the window specified by gc_grace_seconds. The answer relates to Cassandra’s mechanism to handle deletes. As you might be aware, all modifications (or mutations) are immutable, so a delete is really just a marker telling the system not to return that record to any clients. This marker is called a tombstone.

Cassandra performs garbage collection on data marked by a tombstone each time a compaction occurs. If you don’t run the repair, you risk deleted data reappearing unexpectedly. In general, deletes should be avoided when possible as the unfettered buildup of tombstones can cause significant issues.

In the course of normal operations, Cassandra will repair old replicas when their records are requested. Thus, it can be said that read repair operations are lazy, such that they only occur when required.

With all these options for replication and consistency, it can seem daunting to choose the right combination for a given use case. Let’s take a closer look at this balance to help bring some additional clarity to the topic.

Balancing the replication factor with consistency

There are many considerations when choosing a replication factor, including availability, performance, and consistency. Since our topic is high availability, let’s presume your desire is to maintain data availability in the case of node failure.

It’s important to understand exactly what your failure tolerance is, and this will likely be different depending on the nature of the data. The definition of failure is probably going to vary among use cases as well, as one case might consider data loss a failure, whereas another accepts data loss as long as all queries return.

Achieving the desired availability, consistency, and performance targets requires coordinating your replication factor with your application’s consistency level configurations. In order to assist you in your efforts to achieve this balance, let’s consider a single data center cluster of 10 nodes and examine the impact of various configuration combinations (where RF corresponds to the replication factor):

RF

Write CL

Read CL

Consistency

Availability

Use cases

1

ONE

QUORUM

ALL

ONE

QUORUM

ALL

Consistent

Doesn’t tolerate any replica loss

Data can be lost and availability is not critical, such as analysis clusters

2

ONE

ONE

Eventual

Tolerates loss of one replica

Maximum read performance and low write latencies are required, and sometimes returning stale data is acceptable

2

QUORUM

ALL

ONE

Consistent

Tolerates loss of one replica on reads, but none on writes

Read-heavy workloads where some downtime for data ingest is acceptable (improves read latencies)

2

ONE

QUORUM

ALL

Consistent

Tolerates loss of one replica on writes, but none on reads

Write-heavy workloads where read consistency is more important than availability

3

ONE

ONE

Eventual

Tolerates loss of two replicas

Maximum read and write performance are required, and sometimes returning stale data is acceptable

3

QUORUM

ONE

Eventual

Tolerates loss of one replica on write and two on reads

Read throughput and availability are paramount, while write performance is less important, and sometimes returning stale data is acceptable

3

ONE

QUORUM

Eventual

Tolerates loss of two replicas on write and one on reads

Low write latencies and availability are paramount, while read performance is less important, and sometimes returning stale data is acceptable

3

QUORUM

QUORUM

Consistent

Tolerates loss of one replica

Consistency is paramount, while striking a balance between availability and read/write performance

3

ALL

ONE

Consistent

Tolerates loss of two replicas on reads, but none on writes

Additional fault tolerance and consistency on reads is paramount at the expense of write performance and availability

3

ONE

ALL

Consistent

Tolerates loss of two replicas on writes, but none on reads

Low write latencies and availability are paramount, but read consistency must be guaranteed at the expense of performance and availability

3

ANY

ONE

Eventual

Tolerates loss of all replicas on write and two on read

Maximum write and read performance and availability are paramount, and often returning stale data is acceptable (note that hinted writes are less reliable than the guarantees offered at CL ONE)

3

ANY

QUORUM

Eventual

Tolerates loss of all replicas on write and one on read

Maximum write performance and availability are paramount, and sometimes returning stale data is acceptable

3

ANY

ALL

Consistent

Tolerates loss of all replicas on writes, but none on reads

Write throughput and availability are paramount, and clients must all see the same data, even though they might not see all writes immediately

There are also two additional consistency levels, SERIAL and LOCAL_SERIAL, which can be used to read the latest value, even if it is part of an uncommitted transaction. Otherwise, they follow the semantics of QUORUM and LOCAL_QUORUM, respectively.

As you can see, there are numerous possibilities to consider when choosing these values, especially in a scenario involving multiple data centers. This discussion will give you greater confidence as you design your applications to achieve the desired balance.

Summary

In this article, we introduced the foundational concept of consistency. In our discussion, we outlined the importance of the relationship between replication factor and consistency level, and their impact on performance, data consistency, and availability.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here