Cassandra High Performance Cookbook

Over 150 recipes to design and optimize large scale Apache Cassandra deployments

Introduction

Cassandra's popularity has led to several pieces of software that have developed around it. Some of these are libraries and utilities that make working with Cassandra easier. Other software applications have been built completely around Cassandra to take advantage of its scalability. This article describes some of these utilities.

Building Cassandra from source

The Cassandra code base is active and typically has multiple branches. It is a good practice to run official releases, but at times it may be necessary to use a feature or a bug fix that has not yet been released. Building and running Cassandra from source allows for a greater level of control of the environment. Having the source code, it is also possible to trace down and understand the context or warning or error messages you may encounter. This recipe shows how to checkout Cassandra code from Subversion (SVN) and build it.

How to do it...

Visit http://svn.apache.org/repos/asf/cassandra/branches with a web browser. Multiple sub folders will be listed:
/cassandra-0.5/
/cassandra-0.6/
Each folder represents a branch. To check out the 0.6 branch:

$ svn co http://svn.apache.org/repos/asf/cassandra/branches/
cassandra-0.6/

Trunk is where most new development happens. To check out trunk:
$ svn co http://svn.apache.org/repos/asf/cassandra/trunk/

To build the release tar, move into the folder created and run:
$ ant release
This creates a release tar in build/apache-cassandra-0.6.5-bin.tar.gz, a release jar, and an unzipped version in build/dist.

How it works...

Subversion (SVN) is a revision control system commonly used to manage software projects. Subversion repositories are commonly accessed via the HTTP protocol. This allows for simple browsing. This recipe is using the command-line client to checkout code from the repository.

Building the contrib stress tool for benchmarking

Stress is an easy-to-use command-line tool for stress testing and benchmarking Cassandra. It can be used to generate a large quantity of requests in short periods of time, and it can also be used to generate a large amount of data to test performance with. This recipe shows how to build it from the Cassandra source.

Getting ready

Before running this recipe, complete the Building Cassandra from source recipe discussed above.

How to do it...

From the source directory, run ant. Then, change to the contrib/stress directory and run ant again.

$ cd <cassandra_src>
$ ant jar
$ cd contrib/stress
$ ant jar
...
BUILD SUCCESSFUL
Total time: 0 seconds

How it works...

The build process compiles code into the stress.jar file.

Inserting and reading data with the stress tool

The stress tool is a multithreaded load tester specifically for Cassandra. It is a command-line program with a variety of knobs that control its operation. This recipe shows how to run the stress tool.

Before you begin...

See the previous recipe, Building the contrib stress tool for benchmarking before doing this recipe.

How to do it...

Run the <cassandra_src>/bin/stress command to execute 10,000 insert operations.

$ bin/stress -d 127.0.0.1,127.0.0.2,127.0.0.3 -n 10000 --operation
INSERT Keyspace already exists.
total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
10000,1000,1000,0.0201764,3

How it works...

The stress tool is an easy way to do load testing against a cluster. It can insert or read data and report on the performance of those operations. This is also useful in staging environments where significant volumes of disk data are needed to test at scale. Generating data is also useful to practice administration techniques such as joining new nodes to a cluster.

There's more...

It is best to run the load testing tool on a different node than on the system being tested and remove anything else that causes other unnecessary contention.

Running the Yahoo! Cloud Serving Benchmark

The Yahoo! Cloud Serving Benchmark (YCSB) provides benchmarking for the bases of comparison between NoSQL systems. It works by generating random workloads with varying portions of insert, get, delete, and other operations. It then uses multiple threads for executing these operations. This recipe shows how to build and run the YCSB.

Information on the YCSB can be found here:

http://research.yahoo.com/Web_Information_Management/YCSB
https://github.com/brianfrankcooper/YCSB/wiki/
https://github.com/joaquincasares/YCSB

How to do it...

Use the git tool to obtain the source code.
$ git clone git://github.com/brianfrankcooper/YCSB.git

Build the code using the ant.
$ cd YCSB/
$ ant

Copy the JAR files from your <cassandra_hom>/lib directory to the YCSB classpath.
$ cp $HOME/apache-cassandra-0.7.0-rc3-1/lib/*.jar db/
cassandra-0.7/lib/
$ ant dbcompile-cassandra-0.7

Use the Cassandra CLI to create the required keyspace and column family.
[default@unknown] create keyspace usertable with replication_
factor=3;
[default@unknown] use usertable;
[default@unknown] create column family data;

Create a small shell script run.sh to launch the test with different parameters.
CP=build/ycsb.jar
for i in db/cassandra-0.7/lib/*.jar ; do
CP=$CP:${i}
done

java -cp $CP com.yahoo.ycsb.Client -t -db com.yahoo.ycsb.
db.CassandraClient7 -P workloads/workloadb
-p recordcount=10
-p hosts=127.0.0.1,127.0.0.2
-p operationcount=10
-s

Run the script ant pipe the output to more command to control pagination:
$ sh run.sh | more
YCSB Client 0.1
Command line: -t -db com.yahoo.ycsb.db.CassandraClient7 -P
workloads/workloadb -p recordcount=10 -p hosts=127.0.0.1,127.0.0.2
-p operationcount=10 -s
Loading workload...
Starting test.
data
0 sec: 0 operations;
0 sec: 10 operations; 64.52 current ops/sec; [UPDATE
AverageLatency(ms)=30] [READ AverageLatency(ms)=3]
[OVERALL], RunTime(ms), 152.0
[OVERALL], Throughput(ops/sec), 65.78947368421052
[UPDATE], Operations, 1
[UPDATE], AverageLatency(ms), 30.0
[UPDATE], MinLatency(ms), 30
[UPDATE], MaxLatency(ms), 30
[UPDATE], 95thPercentileLatency(ms), 30
[UPDATE], 99thPercentileLatency(ms), 30
[UPDATE], Return=0, 1

How it works...

YCSB has many configuration knobs. An important configuration option is -P, which chooses the workload. The workload describes the portion of read, write, and update percentage. The -p option overrides options from the workload file. YCSB is designed to test performance as the number of nodes grows and shrinks, or scales out.