Getting started with Apache Cassandra

The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together a fully distributed design and a ColumnFamily-based data model. The article contains recipes that allow users to hit the ground running with Cassandra. We show several recipes to set up Cassandra. These include cursory explanations of the key configuration files. It also contains recipes for connecting to Cassandra and executing commands both from the application programmer interface and the command-line interface. Also described are the Java profiling tools such as JConsole. The recipes in this article should help the user understand the basics of running and working with Cassandra.

A simple single node Cassandra installation

Cassandra is a highly scalable distributed database. While it is designed to run on multiple production class servers, it can be installed on desktop computers for functional testing and experimentation. This recipe shows how to set up a single instance of Cassandra.

Getting ready

Visit http://cassandra.apache.org in your web browser and find a link to the latest binary release. New releases happen often. For reference, this recipe will assume apache-cassandra-0.7.2-bin.tar.gz was the name of the downloaded file.

How to do it...

Download a binary version of Cassandra:
$ mkdir $home/downloads
$ cd $home/downloads
$ wget <url_from_getting_ready>/apache-cassandra-0.7.2-bin.tar.gz

Choose a base directory that the user will run as he has read and write access to:
Default Cassandra storage locations
Cassandra defaults to wanting to save data in /var/lib/cassandra and logs in /var/log/cassandra. These locations will likely not exist and will require root-level privileges to create. To avoid permission issues, carry out the installation in user-writable directories.

Create a cassandra directory in your home directory. Inside the cassandra directory, create commitlog, log, saved_caches, and data subdirectories:
$ mkdir $HOME/cassandra/
$ mkdir $HOME/cassandra/{commitlog,log,data,saved_caches}
$ cd $HOME/cassandra/
$ cp $HOME/downloads/apache-cassandra-0.7.2-bin.tar.gz .
$ tar -xf apache-cassandra-0.7.2-bin.tar.gz

Use the echo command to display the path to your home directory. You will need this when editing the configuration file:
$ echo $HOME
/home/edward
This tar file extracts to apache-cassandra-0.7.2 directory. Open up the conf/cassandra.yaml file inside in your text editor and make changes to the following sections:

data_file_directories:
- /home/edward/cassandra/data
commitlog_directory: /home/edward/cassandra/commit
saved_caches_directory: /home/edward/cassandra/saved_caches

Edit the $HOME/apache-cassandra-0.7.2/conf/log4j-server.properties file to change the directory where logs are written:
log4j.appender.R.File=/home/edward/cassandra/log/system.log

Start the Cassandra instance and confirm it is running by connecting with nodetool:
$ $HOME/apache-cassandra-0.7.2/bin/cassandra
INFO 17:59:26,699 Binding thrift service to /127.0.0.1:9160
INFO 17:59:26,702 Using TFramedTransport with a max frame size of
15728640 bytes.

$ $HOME/apache-cassandra-0.7.2/bin/nodetool --host 127.0.0.1 ring
Address Status State Load Token
127.0.0.1 Up Normal 385 bytes 398856952452...

How it works...

Cassandra comes as a compiled Java application in a tar file. By default, it is configured to store data inside /var. By changing options in the cassandra.yaml configuration file, Cassandra uses specific directories created.

YAML: YAML Ain't Markup Language
YAML™ (rhymes with "camel") is a human-friendly, cross-language, Unicode-based data serialization language designed around the common native data types of agile programming languages. It is broadly useful for programming needs ranging from configuration files and Internet messaging to object persistence and data auditing. See http://www.yaml.org for more information.

After startup, Cassandra detaches from the console and runs as a daemon. It opens several ports, including the Thrift port 9160 and JMX port on 8080. For versions of Cassandra higher than 0.8.X, the default port is 7199. The nodetool program communicates with the JMX port to confirm that the server is alive.

There's more...

Due to the distributed design, many of the features require multiple instances of Cassandra running to utilize. For example, you cannot experiment with Replication Factor, the setting that controls how many nodes data is stored on, larger than one. Replication Factor dictates what Consistency Level settings can be used for. With one node the highest Consistency Level is ONE.

Reading and writing test data using the command-line interface

The command-line interface (CLI) presents users with an interactive tool to communicate with the Cassandra server and execute the same operations that can be done from client server code. This recipe takes you through all the steps required to insert and read data.

How to do it...

Start the Cassandra CLI and connect to an instance:
$ <cassandra_home>/bin/cassandra-cli
[default@unknown] connect 127.0.0.1/9160;
Connected to: "Test Cluster" on 127.0.0.1/9160

New clusters do not have any preexisting keyspaces or column families. These need to be created so data can be stored in them:
[default@unknown] create keyspace testkeyspace
[default@testkeyspace] use testkeyspace;
Authenticated to keyspace: testkeyspace
[default@testkeyspace] create column family testcolumnfamily;

Insert and read back data using the set and get commands:
[default@testk..] set testcolumnfamily['thekey']
['thecolumn']='avalue';
Value inserted.
[default@testkeyspace] assume testcolumnfamily validator as
ascii;
[default@testkeyspace] assume testcolumnfamily comparator as
ascii;
[default@testkeyspace] get testcolumnfamily['thekey'];
=> (column=thecolumn, value=avalue, timestamp=1298580528208000)

How it works...

The CLI is a helpful interactive facade on top of the Cassandra API. After connecting, users can carry out administrative or troubleshooting tasks.

Running multiple instances on a single machine

Cassandra is typically deployed on clusters of multiple servers. While it can be run on a single node, simulating a production cluster of multiple nodes is best done by running multiple instances of Cassandra. This recipe is similar to A simple single node Cassandra installation earlier in this article. However in order to run multiple instances on a single machine, we create different sets of directories and modified configuration files for each node.

How to do it...

Ensure your system has proper loopback address support. Each system should have the entire range of 127.0.0.1-127.255.255.255 configured as localhost for loopback. Confirm this by pinging 127.0.0.1 and 127.0.0.2:
$ ping -c 1 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_req=1 ttl=64 time=0.051 ms
$ ping -c 1 127.0.0.2
PING 127.0.0.2 (127.0.0.2) 56(84) bytes of data.
64 bytes from 127.0.0.2: icmp_req=1 ttl=64 time=0.083 ms

Use the echo command to display the path to your home directory. You will need this when editing the configuration file:
$ echo $HOME
/home/edward

Create a hpcas directory in your home directory. Inside the cassandra directory, create commitlog, log, saved_caches, and data subdirectories:
$ mkdir $HOME/hpcas/
$ mkdir $HOME/hpcas/{commitlog,log,data,saved_caches}
$ cd $HOME/hpcas/
$ cp $HOME/downloads/apache-cassandra-0.7.2-bin.tar.gz .

$ tar -xf apache-cassandra-0.7.2-bin.tar.gz

Download and extract a binary distribution of Cassandra. After extracting the binary, move/rename the directory by appending '1' to the end of the filename.$ mv apache-cassandra-0.7.2 apache-cassandra-0.7.2-1 Open the apachecassandra- 0.7.2-1/conf/cassandra.yaml in a text editor. Change the default storage locations and IP addresses to accommodate our multiple instances on the same machine without clashing with each other:
data_file_directories:
- /home/edward/hpcas/data/1
commitlog_directory: /home/edward/hpcas/commitlog/1
saved_caches_directory: /home/edward/hpcas/saved_caches/1
listen_address: 127.0.0.1
rpc_address: 127.0.0.1
Each instance will have a separate logfile. This will aid in troubleshooting. Edit conf/log4j-server.properties:

log4j.appender.R.File=/home/edward/hpcas/log/system1.log
Cassandra uses JMX (Java Management Extensions), which allows you to configure an explicit port but always binds to all interfaces on the system. As a result, each instance will require its own management port. Edit cassandra-env.sh:

JMX_PORT=8001

Start this instance:
$ ~/hpcas/apache-cassandra-0.7.2-1/bin/cassandra

INFO 17:59:26,699 Binding thrift service to /127.0.0.101:9160
INFO 17:59:26,702 Using TFramedTransport with a max frame size of
15728640 bytes.

$ bin/nodetool --host 127.0.0.1 --port 8001 ring

Address Status State Load Token
127.0.0.1 Up Normal 385 bytes 398856952452...
At this point your cluster is comprised of single node. To join other nodes to the cluster, carry out the preceding steps replacing '1' with '2', '3', '4', and so on:

$ mv apache-cassandra-0.7.2 apache-cassandra-0.7.2-2

Open ~/hpcas/apache-cassandra-0.7.2-2/conf/cassandra.yaml in a text editor:
data_file_directories:
- /home/edward/hpcas/data/2
commitlog_directory: /home/edward/hpcas/commitlog/2
saved_caches_directory: /home/edward/hpcas/saved_caches/2
listen_address: 127.0.0.2
rpc_address: 127.0.0.2

Edit ~/hpcas/apache-cassandra-0.7.2-2/conf/log4j-server. properties:
log4j.appender.R.File=/home/edward/hpcas/log/system2.log

Edit ~/hpcas/apache-cassandra-0.7.2-2/conf/cassandra-env.sh:
JMX_PORT=8002

Start this instance:
$ ~/hpcas/apache-cassandra-0.7.2-2/bin/cassandra

How it works...

The Thrift port has to be the same for all instances in a cluster. Thus, it is impossible to run multiple nodes in the same cluster on one IP address. However, computers have multiple loopback addresses: 127.0.0.1, 127.0.0.2, and so on. These addresses do not usually need to be configured explicitly. Each instance also needs its own storage directories. Following this recipe you can run as many instances on your computer as you wish, or even multiple distinct clusters. You are only limited by resources such as memory, CPU time, and hard disk space.