42 min read

In this article by Raúl Estrada, the author of the book Fast Data Processing Systems with SMACK Stack we will learn about Apache Cassandra.

We have reached the part where we talk about storage. The C in the SMACK stack refers to Cassandra. The reader may wonder; why not use a conventional database? The answer is that Cassandra is the database that propels some giants like Walmart, CERN, Cisco, Facebook, Netflix, and Twitter. Spark uses a lot of Cassandra’s power. The application efficiency is greatly increased using the Spark Cassandra Connector.

This article has the following sections:

  • A bit of history
  • NoSQL
  • Apache Cassandra installation
  • Authentication and authorization (roles)
  • Backup and recovery
  • Spark +a connector

(For more resources related to this topic, see here.)

A bit of history

In Greek mythology, there was a priestess who was chastised for her treason againstthe God, Apollo. She asked forthe power of prophecy in exchange for a carnal meeting; however, she failed to fulfill her part of the deal. So, she received a punishment; she would have the power of prophecy, but no one would ever believe her forecasts. This priestess’s name was Cassandra.

Movingto more recenttimes, let’s say 50 years ago, in the world of computing there have been big changes. In 1960, the HDD (Hard Disk Drive) took precedence over the magnetic strips which facilitate data handling. In 1966, IBM created the Information Management System (IMS) for the Apollo space program from whose hierarchical models later developed IBM DB2. In 1970s, a model that is fundamentally changing the existing data storage methods appeared, called the relational data model. Devised by Codd as an alternative to IBM’s IMS and its organization mode and data storage in 1985, his work presented 12 rules that a database should meet in order to be considered a relational database.

The Web (especially social networks) appeared and demanded the storage oflarge amounts of data. The Relational Database Management System (RDBMS) scales the actual costs of databases, the number of users, amount of data, response time, or the time it takes to make a specific query on a database. In the beginning, it waspossible to solve through vertical scaling: the server machine is upgraded with more RAM, higher processors, and larger and faster HDDs. Now we can mitigate the problem, but it will not disappear.

When the same problem occurs again, and the server cannot be upgraded, the only solution is to add a new server, which itself may hide unplanned costs: OS license, Database Management System (DBMS), and so on, without mentioning the data replication, transactions, and data consistency under normal use.

One solution of such problems is the use of NoSQL databases. NoSQL was born from the need to process large amounts of data based on large hardware platforms built through clustering servers.

The term NoSQL is perhaps not precise. A more appropriate term should be Not Only SQL. It is used on several non-relational databases such as Apache Cassandra, MongoDB, Riak, Neo4J, and so on, which have becomemore widespread in recent years.

NoSQL

We will read NoSQL as Not only SQL (SQL, Structured Query Language). NoSQL is a distributed database with an emphasis on scalability, high availability, and ease of administration; the opposite of established relational databases. Don’t think it as a direct replacement for RDBMS, rather, an alternative or a complement. The focus is in avoiding unnecessary complexity, the solution for data storage according to today’s needs, and without a fixed scheme. Due its distributed, the cloud computing is a great NoSQL sponsor.

A NoSQL database model can be:

  • Key-value/Tuple based

    For example, Redis, Oracle NoSQL (ACID compliant), Riak, Tokyo Cabinet / Tyrant, Voldemort, Amazon Dynamo, and Memcached and is used by Linked-In, Amazon, BestBuy, Github, and AOL.

  • Wide Row/Column-oriented-based

    For example, Google BigTable, Apache Cassandra, Hbase/Hypertable, and Amazon SimpleDB and used by Amazon, Google, Facebook, and RealNetworks

  • Document-based

    For example, CouchDB (ACID compliant), MongoDB, TerraStore, and Lotus Notes (possibly the oldest) and used in various financial and other relevant institutions: the US army, SAP, MTV, and SourceForge

  • Object-based

    For example, db4o, Versant, Objectivity, and NEO and used by Siemens, China Telecom, and the European Space Agency.

  • Graph-based

    For example, Neo4J, InfiniteGraph, VertexDb, and FlocDb and used by Twitter, Nortler, Ericson, Qualcomm, and Siemens.

  • XML, multivalue, and others

    In Table 4-1, we have a comparison ofthe mentioned data models:

Model

Performance

Scalability

Flexibility

Complexity

Functionality

key-value

high

high

high

low

depends

column

high

high

high

low

depends

document

high

high

high

low

depends

graph

depends

depends

high

high

graph theory

RDBMS

depends

depends

low

moderate

relational algebra

Table 4-1: Categorization and comparison NoSQL data model of Scofield and Popescu

NoSQL or SQL?

This is thewrong question. It would be better to ask the question: What do we need?

Basically, it all depends on the application’s needs. Nothing is black and white. If consistency is essential, use RDBMS. If we need high-availability, fault tolerance, and scalability then use NoSQL. The recommendation is that in a new project, evaluate the best of each world.

It doesn’t make sense to force NoSQL where it doesn’t fit, because its benefits (scalability, read/write speed in entire order of magnitude, soft data model) are only conditioned advantages achieved in a set of problems that can be solved, per se. It is necessary to carefully weigh, beyond marketing, what exactly is needed, what kind of strategy is needed, and how they will be applied to solve our problem. Consider using a NoSQL database only when you decide that this is a better solution than SQL.

The challenges for NoSQL databases are: elastic scaling, cost-effective, simple and flexible.

In table 4-2, we compare the two models:

NoSQL

RDBMS

Schema-less

Relational schema

Scalable read/write

Scalable read

Auto high availability

Custom high availability

Limited queries

Flexible queries

Eventual consistency

Consistency

BASE

ACID

Table 4-2: Comparison of NoSQL and RDBMS

CAP Brewer’s theorem

In 2000, in Portland Oregon, the United States held the nineteenth international symposium on principles of distributed computing where keynote speaker Eric Brewer, a professor at UC Berkeley talked.

In his presentation, among other things, he said that there are three basic system requirements which have a special relationship when making the design and implementation of applications in a distributed environment, and that a distributed system can have a maximum of two of the three properties (which is the basis of his theorem). The three properties are:

  • Consistency: This property says that the data on one node must be the same data when read from a second node, the second node must show exactly the same data (could be a delay, if someone else in between is performing an update, but not different).
  • Availability: This property says that a failure on one node doesn’t mean the loss of its data; the system must be able to display the requested data.
  • Partition tolerance: This property says that in the event of a breakdown in communication between two nodes, the system should still work, meaning the data will still be available.

In Figure 4-1, we show the CAP Brewer’s theorem with some examples.

 Fast Data Processing Systems with SMACK Stack

Figure 4-1 CAP Brewer’s theorem

Apache Cassandra installation

In the Facebook laboratories, although not visible to the public, new software is developed, for example, the junction between two concepts involving the development departments of Google and Amazon. In short, Cassandra is defined as a distributed database. Since the beginning, the authors took the task of creating a scalable database massively decentralized, optimized for read operations when possible, painlessly modifying data structures, and with all this, not difficult to manage. The solution was found by combining two existing technologies: Google’s BigTable and Amazon’s Dynamo.One of the two authors, A. Lakshman, had earlier worked on BigTable and he borrowed the data model layout, while Dynamo contributed with the overall distributed architecture.

Cassandra is written in Java and for good performance it requires the latest possible JDK version. In Cassandra 1.0, they used another open source project Thriftfor client access, which also came from Facebook and is currently an Apache Software project. In Cassandra 2.0, Thrift was removed in favor of CQL. Initially, thrift was not made just for Cassandra, but it is a software library tool and code generator for accessing backend services.

Cassandra administration is done with the command-line tools or via the JMX console, the default installation allows us to use additional client tools. Since this is a server cluster, it hasdifferent administration rules and it is always good to review thedocumentation to take advantage of other people’s experiences. Cassandra managed the very demanding taskssuccessfully. Often used on site, serving a huge number of users (such as Twitter, Digg, Facebook, and Cisco) that, relatively, often change their complex data models to meet the challenges that will come later, and usually do not have to dealwith expensive hardware or licenses.

At the time of writing, the Cassandra homepage (http://cassandra.apache.org) says that Apple Inc. for example, has a 75000 node cluster storing 10 Petabytes.

Data model

The storage model of Cassandra could be seen as a sorted HashMap of sorted HashMaps.

Cassandra is a database that stores the rows in the form of key-value. In this model, the number of columns is not predefined in advance as in standard relational databases, but a single row can contain several columns. The column (Figure 4-2, Column) is the smallest atomic unit model. Each element in the column consists of a triplet: a name, a value (stored as a series of bytes without regard to the source type), and a timestamp (the time used to determine the most recent record).

Fast Data Processing Systems with SMACK

Figure4-2: Column

All data triplets are obtained from the client, and even a timestamp. Thus, the row consists of a key and a set of data triplets (Figure 4-3).Here is how the super column will look:

Fast Data Processing Systems with SMACK Stack

Figure 4-3: Super column

In addition, the columns can be grouped into so-called column families (Figure 4-4, Column family), which would be somehow equivalent to the table and can be indexed:

Fast Data Processing Systems with SMACK Stack

Figure 4-4: Column family

A higher logical unit is the super column (as shown in the followingFigure 4-5, Super column family), in which columns contain other columns:

Fast Data Processing Systems with SMACK Stack

Figure 4-5: Super column family

Above all is the key space (As shown in Figure 4-6, Cluster with Key Spaces), which would be equivalent to a relational schema andis typically used by one application. The data model is simple, but at the same time very flexible and it takes some time to become accustomed to the new way of thinking while rejecting all the SQL’s syntax luxury.

The replication factor is unique per keyspace. Moreover, keyspace could span multiple clusters and have different replication factors for each of them. This is used in geo-distributed deployments.

Fast Data Processing Systems with SMACK Stack

Figure 4-6: Cluster with key spaces

Data storage

Apache Cassandra is designed to process large amounts of data in a short time; this way of storing data is taken from her big brother, Google’s Bigtable.

Cassandra has a commit log file in which all the new data is recorded in order to ensure their sustainability. When data is successfully written on the commit log file, the recording of the freshest data is stored in a memory structure called memtable (Cassandra considers a writing failure if the same information is in the commit log and in memtable). Data within memtables issorted by Row key.

When memtable is full, its contents are copied to the hard drive in a structure called Sorted String Table (SSTable). The process of copying content from memtable into SSTable is called flush. Data flush is performed periodically, although it could be carried out manually (for example, before restarting a node) through node tool flush commands.

The SSTable provides a fixed, sorted map of row and value keys. Data entered in one SSTable cannot be changed, but is possible to enter new data. The internal structure of SSTable consists of a series of blocks of 64Kb (the block size can be changed), internally a SSTable is a block index used to locate blocks.

One data row is usually stored within several SSTables so reading a single data row is performed in the background combining SSTables and the memtable (which have not yet made flush). In order to optimize the process of connecting, Cassandra uses a memory structure called Bloomfilter. Every SSTable has a bloom filter that checks if the requested row key is in the SSTable before look up in the disk.

In order to reduce row fragmentation through several SSTables, in the background Cassandra performs another process: the compaction, a merge of several SSTables into a single SSTable. Fragmented data iscombined based on the values ​​of a row key. After creating a new SSTable, the old SSTable islabeled as outdated and marked in the garbage collector process for deletion. Compaction has different strategies: size-tiered compaction and leveled compaction and both have their own benefits for different scenarios.

Installation

To install Cassandra, go to http://www.planetcassandra.org/cassandra/.

Installation is simple. After downloading the compressed files, extract them and change a couple of settings in the configuration files (set the new directory path). Run the startup scripts to activate a single node, and the database server. Of course, it is possible to use Cassandra in only one node, but we lose its main power, the distribution. The process of adding new servers to the cluster is called bootstrap and is generally not a difficult operation.

Once all the servers are active, they form a ring of nodes, none of which is central meaning without a main server. Within the ring, the information propagation on all servers is performed through a gossip protocol. In short, one node transmits information about the new instances to only some of their known colleagues, and if one of them already knows from other sources about the new node, the first node propagation is stopped. Thus, the information about the node is propagated in an efficient and rapid way through the network.

It is necessary for a new node activation to seed its information to at least one existing server in the cluster so the gossip protocol works. The server receives its numeric identifier, and each of the ring nodes stores its data. Which nodes store the information depends on the hash MD5 key-value (a combination of key-value) as shown in Figure 4-7, Nodes within a cluster.

Fast Data Processing Systems with SMACK Stack

Figure 4-7: Nodes within a cluster

The nodes are in a circular stack, that is, a ring, and each record is stored on multiple nodes. In case of failure of one of them, the data isstill available. Nodes are occupied according to their identifier integer range, that is, if the calculated value falls into a node range, then the data is saved there. Saving is not performed on only one node, more is better, an operation is considered a success if the data is correctly stored at the most possible nodes. All this is parameterized. In this way, Cassandra achieves sufficient data consistency and provides greater robustness of the entire system, if one node in the ring fails, is always possible to retrieve valid information from the other nodes. In the event that a node comes back online again, it is necessary to synchronize the data on it, which is achieved through the reading operation.

The data is read from all the ring servers, a node saves just the data accepted as valid, that is, the most recent data, the data comparison is made according to the timestamp records. The nodes that don’t have the latest information, refresh theirdata in a low priority back-end process.

Although this brief description of the architecture makes it sound like it is full of holes, in reality everything works flawlessly. Indeed, more servers in the game implies a better general situation.

DataStax OpsCenter

In this section, we make the Cassandra installation on a computer with a Windows operating system (to prove that nobody is excluded).

Installing software under the Apache open license can be complicated on a Windows computer, especially if it is new software, such as Cassandra. To make things simpler we will use a distribution package for easy installation, start-up and work with Cassandra on a Windows computer. The distribution used in this example is called DataStax Community Edition. DataStax contains Apache Cassandra, along with the Cassandra Query Language (CQL) tool and the free version of DataStax OpsCenter for management and monitoring the Cassandra cluster. We can say that OpsCenter is a kind of DBMS for NoSQL databases.

After downloading the installer from the DataStax’s official site, the installation process is quite simple, just keep in mind that DataStax supports Windows 7 and Windows Server 2008 and that DataStax used on a Windows computer must have the Chrome or Firefox web browser (Internet explorer is not supported).

When starting DataStax on a Windows computer, DataStax will open asin Figure 4-8, DataStax OpsCenter.

Fast Data Processing Systems with SMACK Stack

Figure 4-8: DataStax OpsCenter

DataStax consists of a control panel (dashboard), in which we review the events, performance, and capacity of the cluster and also see how many nodes belong to our cluster (in this case a single node). In cluster control, we can see the different types of views (ring, physical, list). Adding a new key space (the equivalent to creating a database in the classic DBMS) is done through the CQLShell using CQL or using the DataStax data modeling. Also, using the data explorer we can view the column family and the database.

Creating a key space

The main tool for managing Cassandra CQL runs in a console interface and this tool is used to add new key spaces from which we will create a column family. The key space is created as follows:

cqlsh> create keyspace hr with strategy_class=‘SimpleStrategy’ and strategy_options_replication_factor=1;

After opening CQL Shell, the command create keyspace will make a new key space, the strategy_class = ‘SimpleStrategy’parameter invokes class replication strategy used when creating new key spaces. Optionally,strategy_options:replication_factor = 1command creates a copy of each row in each cluster node, and the value replication_factor set to 1 produces only one copy of each row on each node (if we set to 2, we will have two copies of each row on each node).

cqlsh> use hr;
cqlsh:hr> create columnfamily employee (sid int primary key, 
... name varchar,
... last_name varchar);

There are two types of keyspaces: SimpleStrategy and NetworkTopologyStrategy, whose syntax is as follows:

{ ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : <integer> };

{ ‘class’ : ‘NetworkTopologyStrategy’[, ‘<data center>‘ : <integer>, ‘<data center>‘ : <integer>] . . . };

When NetworkTopologyStrategyis configured as the replication strategy, we set up one or more virtual data centers.

To create a new column family, we use the create command; select the desired Key Space, and with the command create columnfamily example, we create a new table in which we define the id an integer as a primary key and other attributes like name and lastname.

To make a data entry in column family, we use the insert command:

insert into <table name> (<attribute_1>, < attribute_2> ... < attribute_n>);

When filling data tables we use the common SQL syntax:

cqlsh:hr>insert into employee (sid, name, lastname)  values (1, ‘Raul’, ‘Estrada’);

So we enter data values. With the selectcommand we can review our insert:

cqlsh:hr> select * from employee;
sid | name | last_name
----+------+------------
 1  | Raul | Estrada

Authentication and authorization (roles)

In Cassandra, the authentication and authorization must be configured on the cassandra.yamlfile and two additional files. The first file is to assign rights to users over the key space and column family, while the second is to assign passwords to users. These files are called access.properties and passwd.properties, and are located in the Cassandra installation directory. These files can be opened using our favorite text editor in order to be successfully configured.

Setting up a simple authentication and authorization

The following steps are:

  1. In the access.properitesfile we add the access rights to users and the permissions to read and write certain key spaces and columnfamily.Syntax:
    keyspace.columnfamily.permits = users
    
    Example 1:
    hr <rw> = restrada
    
    Example 2:
    hr.cars <ro> = restrada, raparicio
    

    In example 1, we give full rights in the Key Space hr to restrada while in example 2 we give read-only rights to users to the column family cars.

  2. In the passwd.propertiesfile, user names are matched to passwords, onthe left side of the equal sign we write username and onthe right side the password:
    Example:
    restrada = Swordfish01
    
  3. After we change the files, before restarting Cassandra it is necessary to type the following command in the terminal in order to reflect the changes in the database:
    $ cd <installation_directory>
    $ sh bin/cassandra -f -Dpasswd.properties = conf/passwd.properties 
    -Daccess.properties = conf/access.properties
    

Note: The third step of setting up authentication and authorization doesn’t work onWindows computers and is just needed on Linux distributions. Also, note that user authentication and authorization should not be solved through Cassandra, for safety reasons, in the latest Cassandra versions this function is not included.

Backup

The purpose of making Cassandra a NoSQL database is because when we create a single node, we make a copy of it. Copying the database to other nodes and the exact number of copies depend on the replication factor established when we create a new key space.

But as any other standard SQL database, Cassandra offers to create a backup on the local computer. Cassandra creates a copy of the base using snapshot. It is possible to make a snapshot of all the key spaces, or just one column family. It is also possible to make a snapshot of the entire cluster using the parallel SSH tool (pssh).

If the user decides to snapshot the entire cluster, it can be reinitiated and use an incremental backup on each node.

Incremental backups provide a way to get each node configured separately, through setting the incremental_backupsflagto truein cassandra.yaml.

When incremental backups are enabled, Cassandra hard-links each flushed SSTable to a backups directory under the keyspace data directory. This allows storing backups offsite without transferring entire snapshots.

To snapshot a key space we use the nodetool command:

Syntax:

nodetool snapshot -cf <ColumnFamily><keypace> -t <snapshot_name>

Example:

nodetool snapshot -cf cars hr snapshot1

The snapshot is stored in the Cassandra installation directory:

C:Program FilesDataStax Communitydatadataenexamplesnapshots

Compression

The compression increases the cluster nodes capacity reducing the data size on the disk. With this function, compression also enhances the server’s disk performance.

Compression in Cassandra works better when compressing a column family with a lot of columns, when each row has the same columns, or when we have a lot of common columns with the same data. A good example of this is a column family that contains user information such as user name and password because it is possible that they have the same data repeated. As the greater number of the same data to be extended through the rows, the compression ratio higher is.

Column family compression is made with the Cassandra-CLI tool. It is possible to update existing columns families or create a new column family with specific compression conditions, for example, the compression shown here:

CREATE COLUMN FAMILY users WITH comparator = ‘UTF8Type’
AND key_validation_class = ‘UTF8Type’
AND column_metadata = [
(column_name: name, validation_class: UTF8Type)
(column_name: email, validation_class: UTF8Type)
(column_name: country, validation_class: UTF8Type)
(column_name: birth_date, validation_class: LongType)
]
AND compression_options=(sstable_compression:SnappyCompressor, chunk_length_kb:64);

We will see this output:

Waiting for schema agreement....
... schemas agree across the cluster

After opening the Cassandra-CLI, we need to choose thekey space where the new column family would be. When creating a column family, it is necessary to state that the comparator (UTF8 type) and key_validation_class are of the same type. With this we will ensure that when executing the command we won’t have an exception (generated by a bug). After printing the column names, we set compression_options which has two possible classes: SnappyCompresor that provides faster data compression or DeflateCompresor which provides a higher compression ratio. The chunk_length adjusts compression size in kilobytes.

Recovery

Recovering a key space snapshot requests all the snapshots made for a certain column family. If you use an incremental backup, it is also necessary to provide the incremental backups created after the snapshot. There are multiple ways to perform a recovery from the snapshot. We can use the SSTable loader tool (used exclusively on the Linux distribution) or can recreate the installation method.

Restart node

If the recovery is running on one node, we must first shutdown the node. If the recovery is for the entire cluster, it is necessary to restart each node in the cluster. Here is the procedure:

  1. Shut down the node
  2. Delete all the log files in:C:Program FilesDataStax Communitylogs
  3. Delete all .db files within a specified key space and column family:C:Program FilesDataStax Communitydatadataencars
  4. Locate all Snapshots related to the column family:C:Program FilesDataStax Communitydatadataencarssnapshots1,351,279,613,842,
  5. Copy them to: C:Program FilesDataStax Communitydatadataencars
  6. Re-start the node.

Printing schema

Through DataStax OpsCenter or Apache Cassandra CLI we can obtain the schemes (Key Spaces) with the associated column families, but there is no way to make a data export or print it.

Apache Cassandra is not RDBMS and it is not possible to obtain a relational model scheme from the key space database.

Logs

Apache Cassandra and DataStax OpsCenter both use the Apache log4j logging service API. In the directory where DataStax is installed, under Apache-Cassandra and opsCenter is the conf directory where the file log4j-server.properties is located, log4j-tools.properties for apache-cassandra andlog4j.properties for OpsCenter.

The parameters of the log4j file can be modified using a text editor, log files are stored in plain text in theDataStax Communitylogsdirectory, here it is possible to change the directory location to store the log files.

Configuring log4j

log4j configuration files are divided into several parts where all the parameters are set to specify how collected data is processed and written in the log files.

For RootLoger:

# RootLoger level
log4j.rootLogger = INFO, stdout, R

This section defines the data level, respectively, to all the events recorded in the log file.

As we can see in Table 4-3, log level can be:

Level

Record

ALL

The lowest level, all the events are recorded in the log file

DEBUG

Detailed information about events

ERROR

Information about runtime errors or unexpected events

FATAL

Critical error information

INFO

Information about the state of the system

OFF

The highest level, the log file record is off

TRACE

Detailed debug information

WARN

Information about potential adverse events (unwanted/unexpected runtime errors)

Table 4-3 Log4J Log level

For Standard out stdout:

# stdout
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=
%5p %d{HH:mm:ss,SSS} %m%n

Through the StandardOutputWriterclass,we define the appearance of the data in the log file. ConsoleAppenderclass is used for entry data in the log file, and theConversionPattern class defines the data appearance written into a log file. In the diagram, we can see how the data looks like stored in a log file, which isdefined by the previous configuration.

Log file rotation

In this example, we rotate the log when it reaches 20 Mb and we retain just 50 log files.

# rolling log file 
log4j.appender.R=org.apache.log4j.RollingFileAppender
log4j.appender.R.maxFileSize=20MB
log4j.appender.R.maxBackupIndex=50
log4j.appender.R.layout=org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern=%5p [%t] %d{ISO8601} %F (line %L) %m%n

This part sets the log files. TheRollingFileAppenderclass inherits from FileAppender, and its role is to make a log file backup when it reaches a given size (in this case 20 MB). TheRollingFileAppender class has several methods, these two are the most used:

public void setMaxFileSize( String value )
  • Method to define the file size and can take a value from 0 to 263 using the abbreviations KB, MB, GB.The integer value is automatically converted (in the example, the file size is limited to 20 MB):
    public void setMaxBackupIndex( int maxBackups )
  • Method that defines how the backup file is stored before the oldest log file is deleted (in this case retain 50 log files).

To set the parameters of the location where the log files will be stored, use:

# Edit the next line to point to your logs directory 
log4j.appender.R.File=C:/Program Files (x86)/DataStax Community/logs/cassandra.log

User activity log

log4j API has the ability to store user activity logs.In production, it is not recommended to use DEBUG or TRACE log level.

Transaction log

As mentioned earlier, any new data is stored in the commit log file. Within thecassandra.yaml configuration file, we can set the location where the commit log files will be stored:

# commit log 
commitlog_directory: “C:/Program Files (x86)/DataStax Community/data/commitlog”

SQL dump

It is not possible to make a database SQL dump, onlysnapshot the DB.

CQL

CQL is a language like SQL, CQL means Cassandra Query Language.With this language we make the queries on a Key Space. There are several ways to interact with a Key Space, in the previous section we show how to do it using a shell called CQL shell.

Since CQL is the first way to interact with Cassandra, in Table 4-4, Shell Command Summary, we see the main commands that can be used on the CQL Shell:

Command

Description

Cqlsh

Captures command output and appends it to a file.

CAPTURE

Shows the current consistency level, or given a level, sets it.

CONSISTENCY

Imports and exports CSV (comma-separated values) data to and from Cassandra.

COPY

Provides information about the connected Cassandra cluster, or about the data objects stored in the cluster.

DESCRIBE

Formats the output of a query vertically.

EXPAND

Terminates cqlsh.

EXIT

Enables or disables query paging.

PAGING

Shows the Cassandra version, host, or tracing information for the current cqlsh client session.

SHOW

Executes a file containing CQL statements.

SOURCE

Enables or disables request tracing.

TRACING

Captures command output and appends it to a file.

Table 4-4. Shell command summary

For more detailed information of shell commands, visit:

http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlshCommandsTOC.html

CQL commands

CQL is very similar to SQLas we have already seen in this article. Table 4-5, CQL Command Summary lists the language commands.

CQL, like SQL, is based on sentences/statements.These sentences are for data manipulation and work with their logical container, the key space.

The same as SQL statements, they must end with a semicolon (;)

Command

Description

ALTER KEYSPACE

Change property values of a keyspace.

ALTER TABLE

Modify the column metadata of a table.

ALTER TYPE

Modify a user-defined type. Cassandra 2.1 and later.

ALTER USER

Alter existing user options.

BATCH

Write multiple DML statements.

CREATE INDEX

Define a new index on a single column of a table.

CREATE KEYSPACE

Define a new keyspace and its replica placement strategy.

CREATE TABLE

Define a new table.

CREATE TRIGGER

Registers a trigger on a table.

CREATE TYPE

Create a user-defined type. Cassandra 2.1 and later.

CREATE USER

Create a new user.

DELETE

Removes entire rows or one or more columns from one or more rows.

DESCRIBE

Provides information about the connected Cassandra cluster, or about the data objects stored in the cluster.

DROP INDEX

Drop the named index.

DROP KEYSPACE

Remove the keyspace.

DROP TABLE

Remove the named table.

DROP TRIGGER

Removes registration of a trigger.

DROP TYPE

Drop a user-defined type. Cassandra 2.1 and later.

DROP USER

Remove a user.

GRANT

Provide access to database objects.

INSERT

Add or update columns.

LIST PERMISSIONS

List permissions granted to a user.

LIST USERS

List existing users and their superuser status.

REVOKE

Revoke user permissions.

SELECT

Retrieve data from a Cassandra table.

TRUNCATE

Remove all data from a table.

UPDATE

Update columns in a row.

USE

Connect the client session to a keyspace.

Table 4-5. CQL command summary

For more detailed information of CQL commands visit:

http://docs.datastax.com/en/cql/3.1/cql/cql_reference/cqlCommandsTOC.html

DBMS Cluster

The idea of ​​Cassandra is a database working in a cluster, that is databases on multiple nodes. Although primarily intended for Cassandra Linux distributions is building clusters on Linux servers, Cassandra offers the possibility to build clusters on Windows computers.

The first task that must be done prior to setting up the cluster on Windows computers is opening the firewall for Cassandra DBMS DataStax OpsCenter. Ports that must be open for Cassandra are 7000 and 9160. For OpsCenter, the ports are 7199, 8888, 61620 and 61621. These ports are the default when we install Cassandra and OpsCenter, however, unless it is necessary, we can specify new ports.

Immediately after installing Cassandra and OpsCenter on a Windows computer, it is necessary to stop the DataStax OpsCenter service, the DataStax OpsCenter agent like in Figure 4-9,Microsoft Windows display services.

Fast Data Processing Systems with SMACK Stack

Figure 4-9: Microsoft Windows display services

One of Cassandra’s advantages is that it automatically distributes data in the computers of the cluster using the algorithm for the incoming data. To successfully perform this, it is necessary to assign tokens to each computer in the cluster. The token is a numeric identifier that indicates the computer’s position in the cluster and the data scope in the cluster responsible for that computer. For a successful token generation can be used Python that comes within the Cassandra installation located in the DataStax’s installation directory. In the code for generating tokens, the variable num = 2 refers to the number of computers in the cluster:

$ python -c “num=2; print ““n”“.join([(““token %d: %d”“ %(i,(i*(2**127)/num))) for i in range(0,num)])”

We will see an output like this:

token 0: 0
token 1: 88743298547982745894789547895490438209

It is necessary to preserve the value of the token because they will be required in the following steps. We now need to configure the cassandra.yaml file which we have already met in the authentication and authorization section. The cassandra.yaml file must be configured separately on each computer in the cluster. After opening the file, you need to make the following changes:

Initial_token

On each computer in the cluster, copy the tokens generated. It should start from the token 0 and assign each computer a unique token.

Listen_adress

In this section, we will enter the IP of the computer used.

Seeds

You need to enter the IP address of the primary (main) node in the cluster.

Once the file is modified and saved, you must restart DataStax Community Server as we already saw. This should be done only on the primary node. After that it is possible to check if the cluster nodes have communication using the node tool. In node tool, enter the following command:

nodetool -h localhost ring

If the cluster works, we will see the following result:

AddressDCRackStatusStateLeadOwnsToken
-datacenter1rack1UpNormal13.41 Kb50.0%88743298547982745894789547895490438209
-datacenter1rack1UpNormal6.68 Kb50.0%88743298547982745894789547895490438209

If the cluster is operating normally,select which computer will be the primary OpsCenter (may not be the primary node). Then on that computer open opscenter.conf which can be found in the DataStax’s installation directory. In that directory, you need to find the webserver interface section and set the parameter to the value 0.0.0.0. After that, in the agent section, change the incoming_interfaceparameter to your computer IP address.

In DataStax’s installation directory (on each computer in the cluster) we must configure the address.yamlfile. Within these files, set the stomp_interface local_interfaceparameters and to the IP address of the computer where the file is configured.

Now the primary computer should run the DataStax OpsCenter Community and DataStax OpsCenter agent services. After that, runcomputers the DataStax OpsCentar agent service on all the nodes.

At this point it is possible to open DataStax OpsCenter with anInternet browser and OpsCenter should look like Figure 4-10, Display cluster in OpsCenter.

Fast Data Processing Systems with SMACK Stack

Figure 4-10: Display cluster in OpsCenter

Deleting the database

In Apache Cassandra, there are several ways to delete the database (key space) or parts of the database (column family, individual rows within the family row, and so on).

Although the easiest way to make a deletion is using the DataStax OpsCenter data modeling tool, there are commands that can be executed through the Cassandra-CLI or the CQL shell.

CLI delete commands

InTable 4-6, we have the CLI delete commands:

CLI Command

Function

part

Used to delete a great column, a column from the column family or rows within certain columns

drop columnfamily

Delete column family and all data contained on them

drop keyspace

Delete the key space, all the column families and the data contained on them.

truncate

Delete all the data from the selected column family

Table 4-6 CLI delete commands

CQL shell delete commands

 In Table 4-7, we have the shell delete commands:

CQL shell command

Function

alter_drop

Delete specified column from the column family

delete

Delete one or more columns from one or more rows of the selected column family

delete_columns

Delete columns from the column family

delete_where

Delete individual rows

drop_table

Delete the selected column family and all the data contained on it

drop_columnfamily

Delete column family and all the data contained on it

drop_keyspace

Delete the key space, all the column families and all the data contained on them.

truncate

Delete all data from the selected column family.

Table 4-7 CQL Shell delete commands

DB and DBMS optimization

Cassandra optimization is specified in the cassandra.yamlfile and these properties are used to adjust the performance and specify the use of system resources such as disk I/O, memory, and CPU usage.

  • column_index_size_in_kb:

    Initial value: 64 Kb

    Range of values: –

    Column indices added to each row after the data reached the default size of 64 Kilobytes.

  • commitlog_segment_size_in_mb

    Initial value: 32 Mb

    Range of values: 8-1024 Mb

    Determines the size of the commit log segment. The commit log segment is archived to be obliterated or recycled after they are transferred to the SRM table.

  • commitlog_sync

    Initial value: –

    Range of values: –

    In Cassandra, this method is used for entry reception. This method is closely correlated with commitlog_sync_period_in_ms that controls how often log is synchronized with the disc.

  • commitlog_sync_period_in_ms

    Initial value: 1000 ms

    Range of values: –

    Decides how often to send the commit log to disk when commit_sync is in periodic mode.

  • commitlog_total_space_in_mb

    Initial value: 4096 MB

    Range of values: –

    When the size of the commit log reaches an initial value, Cassandra removes the oldest parts of the commit log. This reduces the data amount and facilitates the launch of fixtures.

  • compaction_preheat_key_cache

    Initial value: true

    Range of values: true / false

    When this value is set to true, the stored key rows are monitored during compression, and after resaves it to a new location in the compressed SSTable.

  • compaction_throughput_mb_per_sec

    Initial value: 16

    Range of values: 0-32

    Compression damping the overall bandwidth throughout the system. Faster data insertion means faster compression.

  • concurrent_compactors

    Initial value: 1 per CPU core

    Range of values: depends on the number of CPU cores

    Adjusts the number of simultaneous compression processes on the node.

  • concurrent_reads

    Initial value: 32

    Range of values: –

    When there is more data than the memory can fit, a bottleneck occurs in reading data from disk.

  • concurrent_writes

    Initial value: 32

    Range of values: –

    Making inserts in Cassandra does not depend on I/O limitations. Concurrent inserts depend on the number of CPU cores. The recommended number of cores is 8.

  • flush_largest_memtables_at

    Initial value: 0.75

    Range of values: –

    This parameter clears the biggest memtable to free disk space. This parameter can be used as an emergency measure to prevent memory loss (out of memory errors)

  • in_memory_compaction_limit_in_mb

    Initial value: 64

    Range of values:

    Limit order size on the memory. Larger orders use a slower compression method.

  • index_interval

    Initial value: 128

    Value range: 128-512

    Controlled sampling records from the first row of the index in the ratio of space and time, that is, the larger the time interval to be sampled the less effective. In technical terms, the interval corresponds to the number of index samples skipped between taking each sample.

  • memtable_flush_queue_size

    Initial value: 4

    Range of values: a minimum set of the maximum number of secondary indexes that make more than one Column family

    Indicates the total number of full-memtable to allow a flush, that is, waiting to the write thread.

  • memtable_flush_writers

    Initial value: 1 (according to the data map)

    Range of values: –

    Number of memtable flush writer threads. These threads are blocked by the disk I/O, and each thread holds a memtable in memory until it is blocked.

  • memtable_total_space_in_mb

    Initial value: 1/3 Java Heap

    Range of values: –

    Total amount of memory used for all the Column family memtables on the node.

  • multithreaded_compaction

    Initial value: false

    Range of values: true/false

    Useful only on nodes using solid state disks

  • reduce_cache_capacity_to

    Initial value: 0.6

    Range of values: –

    Used in combination with reduce_cache_capacity_at. When Java Heap reaches the value of reduce_cache_size_at, this value is the total cache size to reduce the percentage to the declared value (in this case the size of the cache is reduced to 60%). Used to avoid unexpected out-of-memory errors.

  • reduce_cache_size_at

    Initial value: 0.85

    Range of values: 1.0 (disabled)

    When Java Heap marked to full sweep by the garbage Collector reaches a percentage stated on this variable (85%), Cassandra reduces the size of the cache to the value of the variable reduce_cache_capacity_to.

  • stream_throughput_outbound_megabits_per_sec

    Initial value: off, that is, 400 Mbps (50 Mb/s)

    Range of values: –

    Regulate the stream of output file transfer in a node to a given throughput in Mbps. This is necessary because Cassandra mainly do sequential I/O when it streams data during system startup or repair, which can lead to network saturation and affect Remote Procedure Call performance.

Bloom filter

Every SSTable has a Bloom filter. In data requests, the Bloom filter checks whether the requested order exists in the SSTable before any disk I/O. If the value of the Bloom filter is too low, it may cause seizures of large amounts of memory, respectively, a higher Bloom filter value, means less memory use. The Bloom filter range of values ​​is from 0.000744 to 1.0. It is recommended keep the minimum value of the Bloom filter less than 0.1.

The value of the Bloom filter column family is adjusted through the CQL shell as follows:

ALTER TABLE <column_family> WITH bloom_filter_fp_chance = 0.01;

Data cache

Apache Cassandra has two caches by which it achieves highly efficient data caching. These are:

  • cache key (default: enabled): cache index primary key columns families
  • row cache (default: disabled): holding a row in memory so that reading can be done without using the disc

If the key and row cache set, the query of data is accomplished in the way shown in Figure 4-11, Apache Cassandra Cache.

Fast Data Processing Systems with SMACK Stack

Figure 4-11: Apache Cassandra cache

When information is requested, first it checks in the row cache, if the information is available, then row cache returns the result without reading from the disk.

If it has come from a request and the row cache can return a result, it checks if the data can be retrieved through the key cache, which is more efficient than reading from the disk, the retrieved data is finally written to the row cache.

As the key cache memory stores the key location of an individual column family, any increase in key cache has a positive impact on reading data for the column family. If the situation permits, a combination of key cache and row cache increases the efficiency.

It is recommended that the size of the key cache is set in relation to the size of the Java heap.

Row cache is used in situations where data access patterns follow a normal (Gaussian) distribution of rows that contain often-read data and queries often returning data from the most or all the columns.

Within cassandra.yaml files, we have the following options to configure the data cache:

  • key_cache_size_in_mb

    Initial value: empty, meaning“Auto” (min (5% Heap (in MB), 100MB))

    Range of values: blank or 0 (disabled key cache)

    Variable that defines the key cache size per node

  • row_cache_size_in_mb

    Initial value: 0 (disabled)

    Range of values: –

    Variable that defines the row cache size per node

  • key_cache_save_period

    Initial value: 14400 (i.e. 4 hours)

    Range of values: –

    Variable that defines the save frequency of key cache to disk

  • row_cache_save_period

    Initial value: 0 (disabled)

    Range of values: –

    Variable that defines the save frequency of row cache to disk

  • row_cache_provider

    Initial value: SerializingCacheProvider

    Range of values: ConcurrentLinkedHashCacheProvider or SerializingCacheProvider

    Variable that defines the implementation of row cache

Java heap tune up

Apache Cassandra interacts with the operating system using the Java virtual machine, so the Java heap size plays an important role. When starting Cassandra, the size of the Java Heap is set automatically based on the total amount of RAM (Table 4-8, Determination of the Java heap relative to the amount of RAM). The Java heap size can be manually adjusted by changing the values ​​of the following variables contained on the file cassandra-env.sh located in the directoryapache-cassandraconf.

# MAX_HEAP_SIZE = “4G”
# HEAP_NEWSIZE = “800M” 

Total system memory

Java heap size

< 2 Gb

Half of the system memory

2 Gb – 4 Gb

1 Gb

> 4 Gb

One quarter of the system memory, no more than 8 Gb

Table 4-8: Determination of the Java heap relative to the amount of RAM

Java garbage collection tune up

Apache Cassandra has a GC Inspector which is responsible for collecting information on each garbage collection process longer than 200ms. The Garbage Collection Processes that occur frequently and take a lot of time (as concurrent mark-sweep which takes several seconds) indicate that there is a great pressure on garbage collection and in the JVM. The recommendations to address these issues include:

  • Add new nodes
  • Reduce the cache size
  • Adjust items related to the JVM garbage collection

Views, triggers, and stored procedures

By definition (In RDBMS) view represents a virtual table that acts as a real (created) table, which in reality does not contain any data. The obtained data isthe result of a SELECT query. View consists of a rows and columns combination of one or more different tables.

Respectively in NoSQL, in Cassandra all data for key value rows are placed in one Column family. As in NoSQL, there is noJOIN commands and there is no possibility of flexible queries, the SELECT command lists the actual data, but there is no display options for a virtual table, that is, a view.

Since Cassandra does not belong to the RDBMS group, there is no possibility of creating triggers and stored procedures. RI Restrictions can be set only in the application code

Also, as Cassandra does not belong to the RDBMS group, we cannot apply Codd’s rules.

Client-server architecture

At this point, we have probably already noticed that Apache Cassandra runs on a client-server architecture.

By definition, the client-server architecture allows distributed applications, since the tasks are divided into two main parts:

  • On one hand, service providers: the servers.
  • On the other hand, the service petitioners:  the clients.

In this architecture, several clients are allowed to access the server; the server is responsible for meeting requests and handle each one according its own rules.

So far, we have only used one client, managed from the same machine, that is, from the same data network.

CQLs allows us to connect to Cassandra, access a key space, and send CQL statements to the Cassandra server.

This is the most immediate method, but in daily practice, it is common to access the key spaces from different execution contexts (other systems and other programming languages).

Thus, we require other clients different from CQLs, to do it in the Apache Cassandra context, we require connection drivers.

Drivers

A driver is just a software component that allows access to a key space to run CQL statements. Fortunately, there arealready a lot of drivers to create clients for Cassandra in almost any modern programming language, you can see an extensive list at this URL:http://wiki.apache.org/cassandra/ClientOptions.

Typically, in a client-server architecture there are different clients accessing the server from different clients, which are distributed in different networks. Our implementation needs will dictate the required clients.

Summary

NoSQL is not just hype,or ayoung technology; it is an alternative, with known limitations and capabilities. It is not an RDBMS killer. It’s more like a younger brother who is slowly growing up and takes some of the burden. Acceptance is increasing and it will be even better as NoSQL solutions mature. Skepticism may be justified, but only for concrete reasons.

Since Cassandra is an easy and free working environment, suitable for application development, it is recommended, especially with the additional utilities that ease and accelerate database administration.

Cassandra has some faults (for example, user authentication and authorization are still insufficiently supportedin Windows environments) and preferably used when there is a need to store large amounts of data.

For start-up companies that need to manipulate large amounts of data with the aim of costs reduction, implementing Cassandra in a Linux environment is a must-have.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here