Getting Started with Apache Cassandra

0
115
8 min read

 

Cassandra High Performance Cookbook

Cassandra High Performance Cookbook

Over 150 recipes to design and optimize large scale Apache Cassandra deployments

        Read more about this book      

(For more resources on this subject, see here.)

Introduction

The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together a fully distributed design and a ColumnFamily-based data model. The article contains recipes that allow users to hit the ground running with Cassandra. We show several recipes to set up Cassandra. These include cursory explanations of the key configuration files. It also contains recipes for connecting to Cassandra and executing commands both from the application programmer interface and the command-line interface. Also described are the Java profiling tools such as JConsole. The recipes in this article should help the user understand the basics of running and working with Cassandra.

A simple single node Cassandra installation

Cassandra is a highly scalable distributed database. While it is designed to run on multiple production class servers, it can be installed on desktop computers for functional testing and experimentation. This recipe shows how to set up a single instance of Cassandra.

Getting ready

Visit http://cassandra.apache.org in your web browser and find a link to the latest binary release. New releases happen often. For reference, this recipe will assume apache-cassandra-0.7.2-bin.tar.gz was the name of the downloaded file.

How to do it…

  1. Download a binary version of Cassandra:

    $ mkdir $home/downloads
    $ cd $home/downloads
    $ wget <url_from_getting_ready>/apache-cassandra-0.7.2-bin.tar.gz

  2. Choose a base directory that the user will run as he has read and write access to:

    Default Cassandra storage locations
    Cassandra defaults to wanting to save data in /var/lib/cassandra and logs in /var/log/cassandra. These locations will likely not exist and will require root-level privileges to create. To avoid permission issues, carry out the installation in user-writable directories.

  3. Create a cassandra directory in your home directory. Inside the cassandra directory, create commitlog, log, saved_caches, and data subdirectories:

    $ mkdir $HOME/cassandra/
    $ mkdir $HOME/cassandra/{commitlog,log,data,saved_caches}
    $ cd $HOME/cassandra/
    $ cp $HOME/downloads/apache-cassandra-0.7.2-bin.tar.gz .
    $ tar -xf apache-cassandra-0.7.2-bin.tar.gz

  4. Use the echo command to display the path to your home directory. You will need this when editing the configuration file:

    $ echo $HOME
    /home/edward

    This tar file extracts to apache-cassandra-0.7.2 directory. Open up the conf/cassandra.yaml file inside in your text editor and make changes to the following sections:

    data_file_directories:
    - /home/edward/cassandra/data
    commitlog_directory: /home/edward/cassandra/commit
    saved_caches_directory: /home/edward/cassandra/saved_caches

  5. Edit the $HOME/apache-cassandra-0.7.2/conf/log4j-server.properties file to change the directory where logs are written:

    log4j.appender.R.File=/home/edward/cassandra/log/system.log

  6. Start the Cassandra instance and confirm it is running by connecting with nodetool:


    $ $HOME/apache-cassandra-0.7.2/bin/cassandra
    INFO 17:59:26,699 Binding thrift service to /127.0.0.1:9160
    INFO 17:59:26,702 Using TFramedTransport with a max frame size of
    15728640 bytes.

    $ $HOME/apache-cassandra-0.7.2/bin/nodetool --host 127.0.0.1 ring
    Address Status State Load Token
    127.0.0.1 Up Normal 385 bytes 398856952452...

How it works…

Cassandra comes as a compiled Java application in a tar file. By default, it is configured to store data inside /var. By changing options in the cassandra.yaml configuration file, Cassandra uses specific directories created.

YAML: YAML Ain’t Markup Language
YAML™ (rhymes with “camel”) is a human-friendly, cross-language, Unicode-based data serialization language designed around the common native data types of agile programming languages. It is broadly useful for programming needs ranging from configuration files and Internet messaging to object persistence and data auditing. See http://www.yaml.org for more information.

After startup, Cassandra detaches from the console and runs as a daemon. It opens several ports, including the Thrift port 9160 and JMX port on 8080. For versions of Cassandra higher than 0.8.X, the default port is 7199. The nodetool program communicates with the JMX port to confirm that the server is alive.

There’s more…

Due to the distributed design, many of the features require multiple instances of Cassandra running to utilize. For example, you cannot experiment with Replication Factor, the setting that controls how many nodes data is stored on, larger than one. Replication Factor dictates what Consistency Level settings can be used for. With one node the highest Consistency Level is ONE.

Reading and writing test data using the command-line interface

The command-line interface (CLI) presents users with an interactive tool to communicate with the Cassandra server and execute the same operations that can be done from client server code. This recipe takes you through all the steps required to insert and read data.

How to do it…

  1. Start the Cassandra CLI and connect to an instance:

    $ <cassandra_home>/bin/cassandra-cli
    [default@unknown] connect 127.0.0.1/9160;
    Connected to: "Test Cluster" on 127.0.0.1/9160

  2. New clusters do not have any preexisting keyspaces or column families. These need to be created so data can be stored in them:

    [default@unknown] create keyspace testkeyspace
    [default@testkeyspace] use testkeyspace;
    Authenticated to keyspace: testkeyspace
    [default@testkeyspace] create column family testcolumnfamily;

  3. Insert and read back data using the set and get commands:

    [default@testk..] set testcolumnfamily['thekey']['thecolumn']='avalue';
    Value inserted.
    [default@testkeyspace] assume testcolumnfamily validator as
    ascii;
    [default@testkeyspace] assume testcolumnfamily comparator as
    ascii;
    [default@testkeyspace] get testcolumnfamily['thekey'];
    => (column=thecolumn, value=avalue, timestamp=1298580528208000)

How it works…

The CLI is a helpful interactive facade on top of the Cassandra API. After connecting, users can carry out administrative or troubleshooting tasks.

Running multiple instances on a single machine

Cassandra is typically deployed on clusters of multiple servers. While it can be run on a single node, simulating a production cluster of multiple nodes is best done by running multiple instances of Cassandra. This recipe is similar to A simple single node Cassandra installation earlier in this article. However in order to run multiple instances on a single machine, we create different sets of directories and modified configuration files for each node.

How to do it…

  1. Ensure your system has proper loopback address support. Each system should have the entire range of 127.0.0.1-127.255.255.255 configured as localhost for loopback. Confirm this by pinging 127.0.0.1 and 127.0.0.2:

    $ ping -c 1 127.0.0.1
    PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
    64 bytes from 127.0.0.1: icmp_req=1 ttl=64 time=0.051 ms
    $ ping -c 1 127.0.0.2
    PING 127.0.0.2 (127.0.0.2) 56(84) bytes of data.
    64 bytes from 127.0.0.2: icmp_req=1 ttl=64 time=0.083 ms

  2. Use the echo command to display the path to your home directory. You will need this when editing the configuration file:

    $ echo $HOME
    /home/edward

  3. Create a hpcas directory in your home directory. Inside the cassandra directory, create commitlog, log, saved_caches, and data subdirectories:

    $ mkdir $HOME/hpcas/
    $ mkdir $HOME/hpcas/{commitlog,log,data,saved_caches}
    $ cd $HOME/hpcas/
    $ cp $HOME/downloads/apache-cassandra-0.7.2-bin.tar.gz .


    $ tar -xf apache-cassandra-0.7.2-bin.tar.gz

  4. Download and extract a binary distribution of Cassandra. After extracting the binary, move/rename the directory by appending ‘1‘ to the end of the filename.$ mv apache-cassandra-0.7.2 apache-cassandra-0.7.2-1 Open the apachecassandra- 0.7.2-1/conf/cassandra.yaml in a text editor. Change the default storage locations and IP addresses to accommodate our multiple instances on the same machine without clashing with each other:

    data_file_directories:
    - /home/edward/hpcas/data/1
    commitlog_directory: /home/edward/hpcas/commitlog/1
    saved_caches_directory: /home/edward/hpcas/saved_caches/1
    listen_address: 127.0.0.1
    rpc_address: 127.0.0.1

    Each instance will have a separate logfile. This will aid in troubleshooting. Edit conf/log4j-server.properties:

    log4j.appender.R.File=/home/edward/hpcas/log/system1.log

    Cassandra uses JMX (Java Management Extensions), which allows you to configure an explicit port but always binds to all interfaces on the system. As a result, each instance will require its own management port. Edit cassandra-env.sh:

    JMX_PORT=8001

  5. Start this instance:

    $ ~/hpcas/apache-cassandra-0.7.2-1/bin/cassandra

    INFO 17:59:26,699 Binding thrift service to /127.0.0.101:9160
    INFO 17:59:26,702 Using TFramedTransport with a max frame size of
    15728640 bytes.

    $ bin/nodetool --host 127.0.0.1 --port 8001 ring

    Address Status State Load Token
    127.0.0.1 Up Normal 385 bytes 398856952452...

    At this point your cluster is comprised of single node. To join other nodes to the cluster, carry out the preceding steps replacing ‘1‘ with ‘2‘, ‘3‘, ‘4‘, and so on:

    $ mv apache-cassandra-0.7.2 apache-cassandra-0.7.2-2

  6. Open ~/hpcas/apache-cassandra-0.7.2-2/conf/cassandra.yaml in a text editor:

    data_file_directories:
    - /home/edward/hpcas/data/2
    commitlog_directory: /home/edward/hpcas/commitlog/2
    saved_caches_directory: /home/edward/hpcas/saved_caches/2
    listen_address: 127.0.0.2
    rpc_address: 127.0.0.2

  7. Edit ~/hpcas/apache-cassandra-0.7.2-2/conf/log4j-server. properties:

    log4j.appender.R.File=/home/edward/hpcas/log/system2.log

  8. Edit ~/hpcas/apache-cassandra-0.7.2-2/conf/cassandra-env.sh:

    JMX_PORT=8002

  9. Start this instance:

    $ ~/hpcas/apache-cassandra-0.7.2-2/bin/cassandra

How it works…

The Thrift port has to be the same for all instances in a cluster. Thus, it is impossible to run multiple nodes in the same cluster on one IP address. However, computers have multiple loopback addresses: 127.0.0.1, 127.0.0.2, and so on. These addresses do not usually need to be configured explicitly. Each instance also needs its own storage directories. Following this recipe you can run as many instances on your computer as you wish, or even multiple distinct clusters. You are only limited by resources such as memory, CPU time, and hard disk space.

LEAVE A REPLY

Please enter your comment!
Please enter your name here