13 min read

(For more resources related to this topic, see here.)

Cassandra’s storage architecture is designed to manage large data volumes and revolve around some important factors:

  • Decentralized systems
  • Data replication and transparency
  • Data partitioning

Decentralized systems are systems that provide maximum throughput from each node.Cassandra offers decentralization by keeping each node with an identical configuration. There are no such master-slave configurations between nodes. Data is spread across nodes and each node is capable of serving read/write requests with the same efficiency.

A data center is a physical space where critical application data resides. Logically, a data center is made up of multiple racks, and each rack may contain multiple nodes.

Cassandra replication strategies

Cassandra replicates data across the nodes based on configured replication. If the replication factor is 1, it means that one copy of the dataset will be available on one node only. If the replication factor is 2, it means two copies of each dataset will be made available on different nodes in the cluster. Still, Cassandra ensures data transparency, as for an end user data is served from one logical cluster. Cassandra offers two types of replication strategies.

Simple strategy

Simple strategy is best suited for clusters involving a single data center, where data is replicated across different nodes based on the replication factor in a clockwise direction. With a replication factor of 3, two more copies of each row will be copied on nearby nodes in a clockwise direction:

Network topology strategy

Network topology strategy ( NTS ) is preferred when a cluster is made up of nodes spread across multiple data centers. With NTS, we can configure the number of replicas needed to be placed within each data center. Data colocation and no single point of failure are two important factors that we need to consider priorities while configuring the replication factor and consistency level. NTS identifies the first node based on the selected schema partitioning and then looks up for nodes in a different rack (in the same data center). In case there is no such node, data replicas will be passed on to different nodes within the same rack. In this way, data colocation can be guaranteed by keeping the replica of a dataset in the same data center (to serve read requests locally). This also minimizes the risk of network latency at the same time. NTS depends on snitch configuration for proper data replica placement across different data centers.

A snitch relies upon the node IP address for grouping nodes within the network topology. Cassandra depends upon this information for routing data requests internally between nodes. The preferred snitch configurations for NTS are RackInferringSnitch and PropertyFileSnitch . We can configure snitch in cassandra.yaml (the configuration file).

Data partitioning

Data partitioning strategy is required for node selection of a given data read/request. Cassandra offers two types of partitioning strategies.

Random partitioning

Random partitioning is the recommended partitioning scheme for Cassandra. Each node is assigned a 128-bit token value ( initial_token for a node is defined in cassandra.yaml) generated by a one way hashing (MD5) algorithm. Each node is assigned an initial token value (to determine the position in a ring) and a data range is assigned to the node. If a read/write request with the token value (generated for a row key value) lies within the assigned range of nodes, then that particular node is responsible for serving that request. The following diagram is a common graphical representation of the numbers of nodes placed in a circular representation or a ring, and the data range is evenly distributed between these nodes:

Ordered partitioning

Ordered partitioning is useful when an application requires key distribution in a sorted manner. Here, the token value is the actual row key value. Ordered partitioning also allows you to perform range scans over row keys. However, with ordered partitioning, key distribution might be uneven and may require load balancing administration. It is certainly possible that the data for multiple column families may get unevenly distributed and the token range may vary from one node to another. Hence, it is strongly recommended not to opt for ordered partitioning unless it is really required.

Cassandra write path

Here, we will discuss how the Cassandra process writes a request and stores it on a disk:

As we have mentioned earlier, all nodes in Cassandra are peers and there is no master-slave configuration. Hence, on receiving a write request, a client can select any node to serve as a coordinator. The coordinator node is responsible for delegating write requests to an eligible node based on the cluster’s partitioning strategy and replication factor. First, it is written to a commit log and then it is delegated to corresponding memtables (see the preceding diagram). A memtable is an in-memory table, which serves subsequent read requests without any look up in the disk. For each column family, there is one memtable. Once a memtable is full, data is flushed down in the form of SS tables (on disk), asynchronously. Once all the segments are flushed onto the disk, they are recycled. Periodically, Cassandra performs compaction over SS tables (sorted by row keys) and claims unused segments. In case of data node restart (unwanted scenarios such as failover), the commit log replay will happen, to recover any previous incomplete write requests.

Hands on with the Cassandra command-line interface

Cassandra provides a default command-line interface that is located at:

  • CASSANDRA_HOME/bin/cassandra-cli.sh using Linux
  • CASSANDRA_HOME/bin/cassandra-cli.bat using Windows

Before we proceed with the sample exercise, let’s have a look at the Cassandra schema:

  • Keyspace: A keyspace may contain multiple column families; similarly, a cluster (made up of multiple nodes) can contain multiple keyspaces.
  • Column family: A column family is a collection of rows with defined column metadata. Cassandra offers different ways to define two types of column families, namely, static and dynamic column families.
    • Static column family: A static column family contains a predefined set of columns with metadata. Please note that a predefined set of columns may exist, but the number of columns can vary across multiple rows within the column family.
    • Dynamic column family: A dynamic column family generally defines a comparator type and validation class for all columns instead of individual column metadata. The client application is responsible for providing columns for a particular row key, which means the column names and values may differ across multiple row keys:

  • Column: A column can be attributed as a cell, which contains a name, value, and timestamp.
  • Super column: A super column is similar to a column and contains a name, value, and timestamp, except that a super column value may contain a collection of columns. Super columns cannot be sorted; however, subcolumns within super columns can be sorted by defining a sub comparator. Super columns do have some limitations, such as that secondary indexes over super columns are not possible. Also, it is not possible to read a particular super column without deserialization of the wrapped subcolumns. Because of such limitations, usage of super columns is highly discouraged within the Cassandra community. Using composite columns we can achieve such functionalities. In the next articles, we will cover composite columns in detail:

  • Counter column family: Since 0.8 onwards, Cassandra has enabled support for counter columns. Counter columns are useful for applications that perform the following:
    • Maintain the page count for the website
    • Do aggregation based on a column value from another column family

    A counter column is a sort of 64 bit signed integer. To create a counter column family, we simply need to define default_validation_class as CounterColumnType. Counter columns do have some application and technical limitations:

    • In case of events, such as disk failure, it is not possible to replay a column family containing counters without reinitializing and removing all the data
    • Secondary indexes over counter columns are not supported in Cassandra
    • Frequent insert/delete operations over the counter column in a short period of time may result in inconsistent counter values

    There are still some unresolved issues (https://issues.apache.org/jira/browse/CASSANDRA-4775) and to considering the preceding limitations before opting for counter columns is recommended.

    You can start a Cassandra server simply by running $CASSANDRA_HOME/bin/ cassandra. If started in the local mode, it means there is only one node. Once successfully started, you should see logs on your console, as follows:

  • Cassandra-cli: Cassandra distribution, by default, provides a command-line utility (cassandra-cli ), which can be used for basic ddl /dml operations; you can connect to a local/remote Cassandra server instance by specifying the host and port options, as follows:

    $CASSANDRA_HOME/bin/cassandra-cli -host locahost -port 9160

Performing DDL/DML operations on the column family

  • First, we need to create a keyspace using the create keyspace command, as follows:
  • The create keyspace command:

    This operation will create a keyspace cassandraSample with node placement strategy as SimpleStrategy and replication factor one. By default, if you don’t specify placement_strategy and strategy_options, it will opt for NTS, where replication will be on one data center:

    create keyspace cassandraSample with placement_strategy='org.
    apache.cassandra.locator.SimpleStrategy' and strategy_options =
    {replication_factor:1};

    We can look for available keyspaces by running the following command:

    show keyspaces;

    This will result in the following output:

  • We can always update the keyspace for configurations, such as replication factor. To update the keyspace, do the following:
    • Modify the replication factor: You can update a keyspace for changing the replication factor as well as the placement strategy. For example, to change a replication factor to 2 for cassandraSample, you simply need to execute the following command:

      update keyspace cassandraSample with placement_
      strategy='org.apache.cassandra.locator.SimpleStrategy' and
      strategy_options = {replication_factor:2};

    • Modify the placement strategy: You can change the placement strategy for NTS by executing the following command:

      update keyspace cassandraSample with
      placement_strategy='org.apache.cassandra.locator.
      NetworkTopologyStrategy' and strategy_options =
      {datacenter1:1};

      Strategy options are in the format {datacentername:number of replicas}, and there can be multiple datacenters.

  • After successfully creating a keyspace before proceeding with other ddl operations (for example, column family creation), we need to authorize a keyspace. We will authorize to a keyspace using the following command:

    use cassandraSample;

  • Create a column family/super column family as follows:

    Use the following command to create column family users within the cassandraSample keyspace:

    create column family users with key_validation_class =
    'UTF8Type' and comparator = 'UTF8Type' and default_validation_
    class = 'UTF8Type';

    To create a super column family suysers, you need to run the following command:

    create column family suysers with key_validation_
    class = 'UTF8Type' and comparator = 'UTF8Type' and
    subcomparator='UTF8Type' and default_validation_class
    = 'UTF8Type' and column_type='Super' and column_
    metadata=[{column_name: name, validation_class: UTF8Type}];

    key_validation_class: It defines the datatype for the row key

    comparator: It defines the datatype for the column name

    default_validation_class: It defines the datatype for the column value

    subcomparator: It defines the datatype for subcolumns.

  • You can create/update a column by using the set method as follows:

    // create a column named "username", with a value of "user1"
    for row key 1
    set users[1][username] = user1;
    // create a column named "password", with a value of
    "password1" for row key 1
    set users[1][password] = password1;
    // create a column named "username", with a value of "user2"
    for row key 2
    set users[2][username] = user2;
    // create a column named "password", with a value of
    "password2" for row key 2
    set users[2][password] = password2;

  • To fetch all the rows and columns from a column family, execute the following command:

    // to list down all persisted rows within a column family.
    list users ;
    // to fetch a row from users column family having row key value
    "1".
    get users[1];

  • You can delete a column as follows:

    // to delete a column "username" for row key 1;
    del users[1][username];

  • To update the column family, do the following:

    If you want to change key_validation_class from UTF8Type to BytesType and validation_class for the password column from UTF8Type to BytesType, then type the following command:

    update column family users with key_validation_class=BytesType
    and comparator=UTF8Type and column_metadata = [{column_
    name:password, validation_class:BytesType}]

  • To drop/truncate the column family, follow the ensuing steps:
    1. Delete all the data rows from a column family users, as follows:

      truncate users;

    2. Drop a column family by issuing the following command:

      drop column family users;

      These are some basic operations that should give you a brief idea about how to create/manage the Cassandra schema.

Cassandra Query Language

Cassandra is schemaless, but CQL is useful when we need data modeling with the traditional RDBMS flavor. Two variants of CQL (2.0 and 3.0) are provided by Cassandra. We will use CQL3.0 for a quick exercise. We will refer to similar exercises, as we follow with the Cassandra-cli interface.

  • The command to connect with cql is as follows:

    $CASSANDRA_HOME/bin/cqlsh host port cqlversion

  • You can connect to the localhost and 9160 ports by executing the following command:

    $CASSANDRA_HOME/bin/cqlsh localhost 9160 -3

  • After successfully connecting to the command-line CQL client, you can create the keyspace as follows:

    create keyspace cassandrasample with strategy_
    class='SimpleStrategy' and
    strategy_options:replication_factor=1;
    Update keyspace
    alter keyspace cassandrasample with strategy_
    class='NetworkTopologyStrategy' and
    strategy_options:datacenter=1;

  • Before creating any column family and storing data, we need to authorize such ddl/dml operations to a keyspace (for example, cassandraSample). We can authorize to a keyspace as follows:

    use cassandrasample;

  • We can always run the describe keyspace command to look into containing column families and configuration settings. We can describe a keyspace as follows:

    describe keyspace cassandrasample;

  • We will create a users column family with user_id as row key and username and password as columns. To create a column family, such as users, use the following command:

    create columnfamily users(user_id varchar PRIMARY KEY,username
    varchar, password varchar);

  • To store a row in the users column family for row key value 1, we will run the following CQL query:

    insert into users(user_id,username,password)
    values(1,'user1','password1');

  • To select all the data from the users column family, we need to execute the following CQL query:

    select * from users;

  • We can delete a row as well as specific columns using the delete operation. The following command-line scripts are to perform the deletion of a complete row and column age from the users column family, respectively:

    // delete complete row for user_id=1
    delete from users where user_id=1;
    // delete age column from users for row key 1.
    delete age from users where user_id=1;

  • You can update a column family to add columns and to update or drop column metadata.

    Here are a few examples:

    // add a new column
    alter columnfamily users add age int;
    // update column metadata
    alter columnfamily users alter password type blob;

  • Truncating a column family will delete all the data belonging to the corresponding column family, whereas dropping a column family will also remove the column family definition along with the containing data. We can drop/truncate the column family as follows:

    truncate users;
    drop columnfamily users;

  • Dropping a keyspace means instantly removing all the column families and data available within that keyspace.We can drop a keyspace using the following command:

    drop keyspace cassandrasample;

    By default, the CQL shell converts the column family and keyspace name to lowercase. You can ensure case sensitivity by wrapping these identifiers within ” ” .

Summary

This article showed how to create a Java application using Cassandra.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here