10 min read

In this article by Shilpi Saxena, author of the book Real-time Analytics with Storm and Cassandra, we will cover the following topics:

  • What’s possible with data analysis?
  • Real-time analytics—why is it becoming the need of the hour
  • Why storm—the power of high speed distributed computations

We will get you to think about some interesting problems along the lines of Air Traffic Controller (ATC), credit card fraud detection, and so on.

First and foremost, you will understand what is big data. Well, big data is the buzzword of the software industry but it’s much more than the buzz in reality, it’s really a huge amount of data.

(For more resources related to this topic, see here.)

What is big data?

Big data is equal to volume, veracity, variety, and velocity. The descriptions of these are as follows:

  • Volume: Enterprises are awash with ever growing data of all types, easily amassing terabytes even petabytes of information (for example, convert 12 terabytes of tweets created each day into an improved product sentiment analysis or convert 350 billion annual meter readings to better predict power consumption).
  • Velocity: Sometimes, 2 minutes is too late. For time-sensitive processes, such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value (for example, scrutinize 5 million trade events created each day to identify potential fraud or analyze 500 million call detail records daily in real time to predict the customer churn faster).
  • Variety: Big data is any type of data, structured and unstructured data, such as text, sensor data, audio, video, click streams, log files, and many more. New insights are found when analyzing these data types together (for example, monitor hundreds of live video feeds from surveillance cameras to target points of interest or exploit the 80 percent data growth in images, videos, and documents to improve customer satisfaction).

Well now that I have described big data, let’s have a quick look at where is this data generated and how does it come into existence. The following figure demonstrates a quick snapshot of what all can happen in one second in the world of the internet and social media. Now, we need the power to process all this data at the same rate at which it is generated to gain some meaningful insight out of it, as shown:

Real-time Analytics with Storm and Cassandra

The power of computation comes with the Storm and Cassandra combination. This technological combo let’s us cater to the following use cases:

  • Credit card fraud detection
  • Security breaches
  • Bandwidth allocation
  • Machine failures
  • Supply chain
  • Personalized content
  • Recommendations

Get acquainted to few problems that require distributed computing solution

Let’s do a deep dive and identify some of the problems which require distributed solutions.

Real-time business solution for credit or debit card fraud detection

Let’s get acquainted to the problem depicted in the following figure; when we make any transaction using plastic money and swipe our debit or credit card for payment, the duration within which the bank has to validate or reject the transaction is less than 5 seconds. During this less than 5 seconds, data or transaction details have to be encrypted, travel over secure network from servicing back bank to issuing back bank, then at the issuing back bank the entire fuzzy logic for acceptance or decline of the transaction has to computed, and the result has to travel back over the secure network:

Real-time Analytics with Storm and Cassandra

The challenges such as network latency and delay can be optimized to some extent, but to achieve the preceding featuring transaction in less than 5 seconds, one has to design an application that is able to churn a considerable amount of data and generate results in 1 to 2 seconds.

Aircraft Communications Addressing and Reporting system

It is another typical use case that cannot be implemented without having a reliable real-time processing system in place. These systems use Satellite communication (SATCOM), and as per the following figure, they gather voice and packet data from all phases of flight in real-time and are able to generate analytics and alerts on the same data in real-time.

Real-time Analytics with Storm and Cassandra

Let’s take the example from the figure in the preceding case. A flight encounters some real hazardous weather, say, electric Storms on a route, then that information is sent through satellite links and voice or data gateways to the air controller, which in real-time detects and raises the alerts to deviate routes for all other flights passing through that area.

Healthcare

This is another very important domain where real-time analytics over high volume and velocity data has equipped the healthcare professionals with accurate and exact information in real-time to take informed life-saving actions.

Real-time Analytics with Storm and Cassandra

The preceding figure depicts the use case where the doctors can take informed action to handle the medical situation of the patients. Data is collated from historic patient database, drug database, and patient records. Once data is collected it is processed, and live statistics and key parameters of the patient are plotted against the same collated data. This data can be used to further generate reports and alerts to aid the health care professionals in real-time.

Other applications

There are varieties of other applications where power of real-time computing can either optimize or help people take informed decisions. It has become a great utility and aid in following industries:

  • Manufacturing
  • Application performance monitoring
  • Customer relationship management
  • Transportation industry
  • Network optimization

Complexity of existing solutions

Now that we understand the power that real-time solutions can get into various industry verticals, let’s explore and find out what options do we have to process vast amount of data being generated at a very fast pace.

The Hadoop Solution

The Hadoop solution is a tried, tested, and proven solution in industry which we use the MapReduce jobs in clustered setup to execute jobs and generate results.

MapReduce is a programming paradigm where we process large data sets by using a mapper function that processes a key and value pair and thus generate intermediate output again in form of key-value pair. Then a reduce function operates on the mapper output and merges the values associated with same intermediate key and generates result.

Real-time Analytics with Storm and Cassandra

In the preceding figure, we demonstrate the simple word count MapReduce job where:

  • There is a huge big data store which can go up to zettabytes and petabytes
  • Blocks of the input data are split and replicated onto each of the nodes in Hadoop cluster
  • Each mapper job counts the number of words on the data blocks allocated to it
  • Once the mapper is done, the words (which are actually the keys) and the counts are sent to reducers
  • Reducers combine the mapper output and the results are generated

Big data, as we know, did provide a solution to processing and generating results out of humongous volume of data, but that’s predominantly a batch processing system and has almost no utility on real-time use case.

A custom solution

Here we talk about a solution of the kinds twitter used before the advent of Storm. The simplistic version of the problem could be that you need a real-time count of the tweets by each user; Twitter solved the problem by following mechanism shown in the following figure:

Real-time Analytics with Storm and Cassandra

Here is the detailed information of how the preceding mechanism works:

  • They created a fire hose or queue onto which all the tweets are pushed.
  • A set of workers’ nodes read from the queue and decipher the tweet Json and maintain the count of tweets by each user by different workers. At first set of workers the data or the number of tweets are equally distributed amongst the workers, so they are shared randomly.
  • These workers assimilate these first level count into next set of queues.
  • From these queues (the ones mentioned at level 1) second level of workers pick from these queues. Here the sharding is not random an algorithm is in place which ensures that tweet count of one user always goes to same worker. Then the counts are dumped into data store.

The queue-worker solution is described in the following:

  • Very complex and specific to the use case
  • Redeployment and reconfiguration is a huge task
  • Scaling is very tedious
  • System is not fault tolerant

Paid solution

Well this is always an option, lot of big companies have invested in products which let us do this kind of computing but that comes at a heavy license cost. Few solutions to name are from companies such as:

  • IBM
  • Oracle
  • Vertica
  • Gigaspace

Open real-time processing tools

There are few other technologies which have some similar traits and features such as Apache Storm and S4 from Yahoo, but it lacks guaranteed processing. Spark is one is essentially a batch processing system with some features on micro-batching, which could be utilized as real-time.

So finally after evaluation of all these problems, we still find Storm as the best open-source candidate to handle these use cases.

Storm persistence

Storm processes the streaming data at very high velocity. Cassandra complements the Storms ability to process by providing support to write and read to NoSQL at a very high rate. There are variety of API’s available for connecting with Cassandra.

In general the API’s we are talking are wrappers written over core thrift API, which offer various crud operations over Cassandra cluster using programmer friendly packages.

  • Thrift protocol: The most basic and core of all APIs for access to Cassandra it is the RPC protocol, which provides a language neutral interface and thus exposes flexibility to communicate using Python, Java and so on. Please note almost all other API’s we’d discuss are using Thrift under the hood. It is simple to use and provides basic functionality out of the box such as ring discovery, and native access. Complex features such as retry, connection pooling, and so on are not supported out of the box. We have variety of libraries which have extended Thrift and added these much required features, we’d like to touch upon a few widely used ones in this article.
  • Hector: This is has the privilege of being one of the most stable and extensively used API for java based client applications to access the Cassandra. As said earlier it uses Thrift underneath, so it can’t essentially offer any feature or functionality not supported by Thrift protocol. The reasons for its wide spread use are number of essential features ready to use and available out of the box.
    • It has implementation for connection pooling
    • It has ring discovery feature with an add on of automatic failover support
    • It has a retry for downed hosts in Cassandra ring
  • Datastax Java Driver: This one is again a recent addition to the stack of client access options to Cassandra and hence gels well with newer version of Cassandra. Here are the salient features:
    • Connection pooling
    • Reconnection policies
    • Load balancing
    • Cursor support
  • Astyanax: It is a very recent addition to bouquet of Cassandra client API’s and has been developed by Netflix, which definitely makes it more fabled than others. Let’s have a look at its credentials to see where does it qualifies:
      • It supports all Hector functions and is much more easier to use
      • Promises better connection pooling than hector
      • Has a better failover handling than Hector
      • It gives me some out of the box database like features (now that’s a big news) At API level it provides me functionality called Recipes in its terms which provides:
        Parallel all row query execution
        Messaging queue functionality
        Object store
        Pagination
  • It has numerous frequently required utilities such as following:
    • JSON Writer
    • CVS importer

Summary

In this article, we reviewed the what is big data, how it is analysed, the applications in which it it used, the complexity of the solutions and the monitoring tools of Storm.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here