(For more resources related to this topic, see here.)
Cassandra’s storage architecture is designed to manage large data volumes and revolve around some important factors:
Decentralized systems are systems that provide maximum throughput from each node.Cassandra offers decentralization by keeping each node with an identical configuration. There are no such master-slave configurations between nodes. Data is spread across nodes and each node is capable of serving read/write requests with the same efficiency.
A data center is a physical space where critical application data resides. Logically, a data center is made up of multiple racks, and each rack may contain multiple nodes.
Cassandra replicates data across the nodes based on configured replication. If the replication factor is 1, it means that one copy of the dataset will be available on one node only. If the replication factor is 2, it means two copies of each dataset will be made available on different nodes in the cluster. Still, Cassandra ensures data transparency, as for an end user data is served from one logical cluster. Cassandra offers two types of replication strategies.
Simple strategy is best suited for clusters involving a single data center, where data is replicated across different nodes based on the replication factor in a clockwise direction. With a replication factor of 3, two more copies of each row will be copied on nearby nodes in a clockwise direction:
Network topology strategy ( NTS ) is preferred when a cluster is made up of nodes spread across multiple data centers. With NTS, we can configure the number of replicas needed to be placed within each data center. Data colocation and no single point of failure are two important factors that we need to consider priorities while configuring the replication factor and consistency level. NTS identifies the first node based on the selected schema partitioning and then looks up for nodes in a different rack (in the same data center). In case there is no such node, data replicas will be passed on to different nodes within the same rack. In this way, data colocation can be guaranteed by keeping the replica of a dataset in the same data center (to serve read requests locally). This also minimizes the risk of network latency at the same time. NTS depends on snitch configuration for proper data replica placement across different data centers.
A snitch relies upon the node IP address for grouping nodes within the network topology. Cassandra depends upon this information for routing data requests internally between nodes. The preferred snitch configurations for NTS are RackInferringSnitch and PropertyFileSnitch . We can configure snitch in cassandra.yaml (the configuration file).
Data partitioning strategy is required for node selection of a given data read/request. Cassandra offers two types of partitioning strategies.
Random partitioning is the recommended partitioning scheme for Cassandra. Each node is assigned a 128-bit token value ( initial_token for a node is defined in cassandra.yaml) generated by a one way hashing (MD5) algorithm. Each node is assigned an initial token value (to determine the position in a ring) and a data range is assigned to the node. If a read/write request with the token value (generated for a row key value) lies within the assigned range of nodes, then that particular node is responsible for serving that request. The following diagram is a common graphical representation of the numbers of nodes placed in a circular representation or a ring, and the data range is evenly distributed between these nodes:
Ordered partitioning is useful when an application requires key distribution in a sorted manner. Here, the token value is the actual row key value. Ordered partitioning also allows you to perform range scans over row keys. However, with ordered partitioning, key distribution might be uneven and may require load balancing administration. It is certainly possible that the data for multiple column families may get unevenly distributed and the token range may vary from one node to another. Hence, it is strongly recommended not to opt for ordered partitioning unless it is really required.
Here, we will discuss how the Cassandra process writes a request and stores it on a disk:
As we have mentioned earlier, all nodes in Cassandra are peers and there is no master-slave configuration. Hence, on receiving a write request, a client can select any node to serve as a coordinator. The coordinator node is responsible for delegating write requests to an eligible node based on the cluster’s partitioning strategy and replication factor. First, it is written to a commit log and then it is delegated to corresponding memtables (see the preceding diagram). A memtable is an in-memory table, which serves subsequent read requests without any look up in the disk. For each column family, there is one memtable. Once a memtable is full, data is flushed down in the form of SS tables (on disk), asynchronously. Once all the segments are flushed onto the disk, they are recycled. Periodically, Cassandra performs compaction over SS tables (sorted by row keys) and claims unused segments. In case of data node restart (unwanted scenarios such as failover), the commit log replay will happen, to recover any previous incomplete write requests.
Cassandra provides a default command-line interface that is located at:
Before we proceed with the sample exercise, let’s have a look at the Cassandra schema:
A counter column is a sort of 64 bit signed integer. To create a counter column family, we simply need to define default_validation_class as CounterColumnType. Counter columns do have some application and technical limitations:
There are still some unresolved issues (https://issues.apache.org/jira/browse/CASSANDRA-4775) and to considering the preceding limitations before opting for counter columns is recommended.
You can start a Cassandra server simply by running $CASSANDRA_HOME/bin/ cassandra. If started in the local mode, it means there is only one node. Once successfully started, you should see logs on your console, as follows:
$CASSANDRA_HOME/bin/cassandra-cli -host locahost -port 9160
This operation will create a keyspace cassandraSample with node placement strategy as SimpleStrategy and replication factor one. By default, if you don’t specify placement_strategy and strategy_options, it will opt for NTS, where replication will be on one data center:
create keyspace cassandraSample with placement_strategy='org.
apache.cassandra.locator.SimpleStrategy' and strategy_options =
{replication_factor:1};
We can look for available keyspaces by running the following command:
show keyspaces;
This will result in the following output:
update keyspace cassandraSample with placement_
strategy='org.apache.cassandra.locator.SimpleStrategy' and
strategy_options = {replication_factor:2};
update keyspace cassandraSample with
placement_strategy='org.apache.cassandra.locator.
NetworkTopologyStrategy' and strategy_options =
{datacenter1:1};
Strategy options are in the format {datacentername:number of replicas}, and there can be multiple datacenters.
use cassandraSample;
Use the following command to create column family users within the cassandraSample keyspace:
create column family users with key_validation_class =
'UTF8Type' and comparator = 'UTF8Type' and default_validation_
class = 'UTF8Type';
To create a super column family suysers, you need to run the following command:
create column family suysers with key_validation_
class = 'UTF8Type' and comparator = 'UTF8Type' and
subcomparator='UTF8Type' and default_validation_class
= 'UTF8Type' and column_type='Super' and column_
metadata=[{column_name: name, validation_class: UTF8Type}];
key_validation_class: It defines the datatype for the row key
comparator: It defines the datatype for the column name
default_validation_class: It defines the datatype for the column value
subcomparator: It defines the datatype for subcolumns.
// create a column named "username", with a value of "user1"
for row key 1
set users[1][username] = user1;
// create a column named "password", with a value of
"password1" for row key 1
set users[1][password] = password1;
// create a column named "username", with a value of "user2"
for row key 2
set users[2][username] = user2;
// create a column named "password", with a value of
"password2" for row key 2
set users[2][password] = password2;
// to list down all persisted rows within a column family.
list users ;
// to fetch a row from users column family having row key value
"1".
get users[1];
// to delete a column "username" for row key 1;
del users[1][username];
If you want to change key_validation_class from UTF8Type to BytesType and validation_class for the password column from UTF8Type to BytesType, then type the following command:
update column family users with key_validation_class=BytesType
and comparator=UTF8Type and column_metadata = [{column_
name:password, validation_class:BytesType}]
truncate users;
drop column family users;
These are some basic operations that should give you a brief idea about how to create/manage the Cassandra schema.
Cassandra is schemaless, but CQL is useful when we need data modeling with the traditional RDBMS flavor. Two variants of CQL (2.0 and 3.0) are provided by Cassandra. We will use CQL3.0 for a quick exercise. We will refer to similar exercises, as we follow with the Cassandra-cli interface.
$CASSANDRA_HOME/bin/cqlsh host port cqlversion
$CASSANDRA_HOME/bin/cqlsh localhost 9160 -3
create keyspace cassandrasample with strategy_
class='SimpleStrategy' and
strategy_options:replication_factor=1;
Update keyspace
alter keyspace cassandrasample with strategy_
class='NetworkTopologyStrategy' and
strategy_options:datacenter=1;
use cassandrasample;
describe keyspace cassandrasample;
create columnfamily users(user_id varchar PRIMARY KEY,username
varchar, password varchar);
insert into users(user_id,username,password)
values(1,'user1','password1');
select * from users;
// delete complete row for user_id=1
delete from users where user_id=1;
// delete age column from users for row key 1.
delete age from users where user_id=1;
Here are a few examples:
// add a new column
alter columnfamily users add age int;
// update column metadata
alter columnfamily users alter password type blob;
truncate users;
drop columnfamily users;
drop keyspace cassandrasample;
By default, the CQL shell converts the column family and keyspace name to lowercase. You can ensure case sensitivity by wrapping these identifiers within ” ” .
This article showed how to create a Java application using Cassandra.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…