(For more resources related to this topic, see here.)

In this article, we will implement the Batch and Service layers to complete the architecture.

There are some key concepts underlying this big data architecture:

Immutable state
Abstraction and composition
Constrain complexity

Immutable state is the key, in that it provides true fault-tolerance for the architecture. If a failure is experienced at any level, we can always rebuild the data from the original immutable data. This is in contrast to many existing data systems, where the paradigm is to act on mutable data. This approach may seem simple and logical; however, it exposes the system to a particular kind of risk in which the state is lost or corrupted. It also constrains the system, in that you can only work with the current view of the data; it isn't possible to derive new views of the data. When the architecture is based on a fundamentally immutable state, it becomes both flexible and fault-tolerant.

Abstractions allow us to remove complexity in some cases, and in others they can introduce complexity. It is important to achieve an appropriate set of abstractions that increase our productivity and remove complexity, but at an appropriate cost. It must be noted that all abstractions leak, meaning that when failures occur at a lower abstraction, they will affect the higher-level abstractions. It is therefore often important to be able to make changes within the various layers and understand more than one layer of abstraction. The designs we choose to implement our abstractions must therefore not prevent us from reasoning about or working at the lower levels of abstraction when required. Open source projects are often good at this, because of the obvious access to the code of the lower level abstractions, but even with source code available, it is easy to convolute the abstraction to the extent that it becomes a risk. In a big data solution, we have to work at higher levels of abstraction in order to be productive and deal with the massive complexity, so we need to choose our abstractions carefully. In the case of Storm, Trident represents an appropriate abstraction for dealing with the data-processing complexity, but the lower level Storm API on which Trident is based isn't hidden from us. We are therefore able to easily reason about Trident based on an understanding of lower-level abstractions within Storm.

Another key issue to consider when dealing with complexity and productivity is composition. Composition within a given layer of abstraction allows us to quickly build out a solution that is well tested and easy to reason about. Composition is fundamentally decoupled, while abstraction contains some inherent coupling to the lower-level abstractions—something that we need to be aware of.

Finally, a big data solution needs to constrain complexity. Complexity always equates to risk and cost in the long run, both from a development perspective and from an operational perspective. Real-time solutions will always be more complex than batch-based systems; they also lack some of the qualities we require in terms of performance. Nathan Marz's Lambda architecture attempts to address this by combining the qualities of each type of system to constrain complexity and deliver a truly fault-tolerant architecture.

We divided this flow into preprocessing and "at time" phases, using streams and DRPC streams respectively. We also introduced time windows that allowed us to segment the preprocessed data. In this article, we complete the entire architecture by implementing the Batch and Service layers.

The Service layer is simply a store of a view of the data. In this case, we will store this view in Cassandra, as it is a convenient place to access the state alongside Trident's state. The preprocessed view is identical to the preprocessed view created by Trident, counted elements of the TF-IDF formula (D, DF, and TF), but in the batch case, the dataset is much larger, as it includes the entire history.

The Batch layer is implemented in Hadoop using MapReduce to calculate the preprocessed view of the data. MapReduce is extremely powerful, but like the lower-level Storm API, is potentially too low-level for the problem at hand for the following reasons:

We need to describe the problem as a data pipeline; MapReduce isn't congruent with such a way of thinking
Productivity

We would like to think of a data pipeline in terms of streams of data, tuples within the stream and predicates acting on those tuples. This allows us to easily describe a solution to a data processing problem, but it also promotes composability, in that predicates are fundamentally composable, but pipelines themselves can also be composed to form larger, more complex pipelines. Cascading provides such an abstraction for MapReduce in the same way as Trident does for Storm.

With these tools, approaches, and considerations in place, we can now complete our real-time big data architecture. There are a number of elements, that we will update, and a number of elements that we will add. The following figure illustrates the final architecture, where the elements in light grey will be updated from the existing recipe, and the elements in dark grey will be added in this article:

integrating-storm-and-hadoop-img-0

Implementing TF-IDF in Hadoop

TF-IDF is a well-known problem in the MapReduce communities; it is well-documented and implemented, and it is interesting in that it is sufficiently complex to be useful and instructive at the same time. Cascading has a series of tutorials on TF-IDF at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/, which documents this implementation well. For this recipe, we shall use a Clojure Domain Specific Language (DSL) called Cascalog that is implemented on top of Cascading. Cascalog has been chosen because it provides a set of abstractions that are very semantically similar to the Trident API and are very terse while still remaining very readable and easy to understand.

Getting ready

Before you begin, please ensure that you have installed Hadoop by following the instructions at http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/.

How to do it…

Start by creating the project using the lein command:
```
lein new tfidf-cascalog
```

Next, you need to edit the project.clj file to include the dependencies:

(defproject tfidf-cascalog "0.1.0-SNAPSHOT"
:dependencies [[org.clojure/clojure "1.4.0"]
               [cascalog "1.10.1"]
               [org.apache.cassandra/cassandra-all "1.1.5"]
               [clojurewerkz/cassaforte "1.0.0-beta11-SNAPSHOT"]
               [quintona/cascading-cassandra "0.0.7-SNAPSHOT"]
               [clj-time "0.5.0"]
               [cascading.avro/avro-scheme "2.2-SNAPSHOT"]
               [cascalog-more-taps "0.3.0"]
               [org.apache.httpcomponents/httpclient "4.2.3"]]
 :profiles{:dev{:dependencies[[org.apache.hadoop/hadoop-core "0.20.2-dev"]
                                   [lein-midje "3.0.1"]
                                   [cascalog/midje-cascalog "1.10.1"]]}})

It is always a good idea to validate your dependencies; to do this, execute lein deps and review any errors. In this particular case, cascading-cassandra has not been deployed to clojars, and so you will receive an error message. Simply download the source from https://github.com/quintona/cascading-cassandra and install it into your local repository using Maven.

It is also good practice to understand your dependency tree. This is important to not only prevent duplicate classpath issues, but also to understand what licenses you are subject to. To do this, simply run lein pom, followed by mvn dependency:tree. You can then review the tree for conflicts. In this particular case, you will notice that there are two conflicting versions of Avro. You can fix this by adding the appropriate exclusions:
```
[org.apache.cassandra/cassandra-all "1.1.5"
           :exclusions [org.apache.cassandra.deps/avro]]
```
We then need to create the Clojure-based Cascade queries that will process the document data. We first need to create the query that will create the "D" view of the data; that is, the D portion of the TF-IDF function. This is achieved by defining a Cascalog function that will output a key and a value, which is composed of a set of predicates:
```
(defn D [src]
  (let [src  (select-fields src ["?doc-id"])]
    (<- [?key ?d-str]
        (src ?doc-id)
        (c/distinct-count ?doc-id :> ?n-docs)
        (str "twitter" :> ?key)
        (str ?n-docs :> ?d-str))))
```
You can define this and any of the following functions in the REPL, or add them to core.clj in your project. If you want to use the REPL, simply use lein repl from within the project folder. The required namespace (the use statement), require, and import definitions can be found in the source code bundle.

We then need to add similar functions to calculate the TF and DF values:

(defn DF [src]
  (<- [?key ?df-count-str]
      (src ?doc-id ?time ?df-word)
      (c/distinct-count ?doc-id ?df-word :> ?df-count)
      (str ?df-word :> ?key)
      (str ?df-count :> ?df-count-str)))

(defn TF [src]
  (<- [?key ?tf-count-str]
      (src ?doc-id ?time ?tf-word)
      (c/count ?tf-count)
      (str ?doc-id ?tf-word :> ?key)
      (str ?tf-count :> ?tf-count-str)))

This Batch layer is only interested in calculating views for all the data leading up to, but not including, the current hour. This is because the data for the current hour will be provided by Trident when it merges this batch view with the view it has calculated. In order to achieve this, we need to filter out all the records that are within the current hour. The following function makes that possible:
```
(deffilterop timing-correct? [doc-time]
  (let [now (local-now)
        interval (in-minutes (interval (from-long doc-time) now))]
    (if (< interval 60) false true))
```
Each of the preceding query definitions require a clean stream of words. The text contained in the source documents isn't clean. It still contains stop words. In order to filter these and emit a clean set of words for these queries, we can compose a function that splits the text into words and filters them based on a list of stop words and the time function defined previously:
```
(defn etl-docs-gen [rain stop]
  (<- [?doc-id ?time ?word]
      (rain ?doc-id ?time ?line)
      (split ?line :> ?word-dirty)
      ((c/comp s/trim s/lower-case) ?word-dirty :> ?word)
      (stop ?word :> false)
      (timing-correct? ?time)))
```

We will be storing the outputs from our queries to Cassandra, which requires us to define a set of taps for these views:

(defn create-tap [rowkey cassandra-ip]
  (let [keyspace storm_keyspace
            column-family "tfidfbatch"
            scheme        (CassandraScheme. cassandra-ip
                                            "9160"
                                            keyspace
                                            column-family
                                            rowkey
                                            {"cassandra.inputPartitioner"
"org.apache.cassandra.dht.RandomPartitioner" "cassandra.outputPartitioner" 
"org.apache.cassandra.dht.RandomPartitioner"})
            tap           (CassandraTap. scheme)]
  tap))

(defn create-d-tap [cassandra-ip]
  (create-tap "d"cassandra-ip))

(defn create-df-tap [cassandra-ip]
  (create-tap "df" cassandra-ip))

(defn create-tf-tap [cassandra-ip]
  (create-tap "tf" cassandra-ip))

The way this schema is created means that it will use a static row key and persist name-value pairs from the tuples as column:value within that row. This is congruent with the approach used by the Trident Cassandra adaptor. This is a convenient approach, as it will make our lives easier later.

We can complete the implementation by a providing a function that ties everything together and executes the queries:

(defn execute [in stop cassandra-ip]
  (cc/connect! cassandra-ip)
  (sch/set-keyspace storm_keyspace)
  (let [input (tap/hfs-tap (AvroScheme. (load-schema)) in)
        stop (hfs-delimited stop :skip-header? true)
        src  (etl-docs-gen input stop)]
    (?- (create-d-tap cassandra-ip)
        (D src))
    (?- (create-df-tap cassandra-ip)
        (DF src))
    (?- (create-tf-tap cassandra-ip)
        (TF src))))

Next, we need to get some data to test with. I have created some test data, which is available at https://bitbucket.org/qanderson/tfidf-cascalog. Simply download the project and copy the contents of src/data to the data folder in your project structure.

We can now test this entire implementation. To do this, we need to insert the data into Hadoop:

hadoop fs -copyFromLocal ./data/document.avro data/document.avro hadoop fs -copyFromLocal ./data/en.stop data/en.stop

Then launch the execution from the REPL:


=> (execute "data/document"  "data/en.stop" "127.0.0.1")

How it works…

There are many excellent guides on the Cascalog wiki (https://github.com/nathanmarz/cascalog/wiki), but for completeness's sake, the nature of a Cascalog query will be explained here. Before that, however, a revision of Cascading pipelines is required.

The following is quoted from the Cascading documentation (http://docs.cascading.org/cascading/2.1/userguide/htmlsingle/):

Pipe assemblies define what work should be done against tuple streams, which are read from tap sources and written to tap sinks. The work performed on the data stream may include actions such as filtering, transforming, organizing, and calculating. Pipe assemblies may use multiple sources and multiple sinks, and may define splits, merges, and joins to manipulate the tuple streams.

This concept is embodied in Cascalog through the definition of queries. A query takes a set of inputs and applies a list of predicates across the fields in each tuple of the input stream. Queries are composed through the application of many predicates. Queries can also be composed to form larger, more complex queries. In either event, these queries are reduced down into a Cascading pipeline. Cascalog therefore provides an extremely terse and powerful abstraction on top of Cascading; moreover, it enables an excellent development workflow through the REPL. Queries can be easily composed and executed against smaller representative datasets within the REPL, providing the idiomatic API and development workflow that makes Clojure beautiful.

If we unpack the query we defined for TF, we will find the following code:

(defn DF [src]
  (<- [?key ?df-count-str]
      (src ?doc-id ?time ?df-word)
      (c/distinct-count ?doc-id ?df-word :> ?df-count)
      (str ?df-word :> ?key)
      (str ?df-count :> ?df-count-str)))

The <- macro defines a query, but does not execute it. The initial vector, [?key ?df-count-str], defines the output fields, which is followed by a list of predicate functions. Each predicate can be one of the following three types:

Generators: A source of data where the underlying source is either a tap or another query.
Operations: Implicit relations that take in input variables defined elsewhere and either act as a function that binds new variables or a filter. Operations typically act within the scope of a single tuple.
Aggregators: Functions that act across tuples to create aggregate representations of data. For example, count and sum.

The :> keyword is used to separate input variables from output variables. If no :> keyword is specified, the variables are considered as input variables for operations and output variables for generators and aggregators.

The (src ?doc-id ?time ?df-word) predicate function names the first three values within the input tuple, whose names are applicable within the query scope. Therefore, if the tuple ("doc1" 123324 "This") arrives in this query, the variables would effectively bind as follows:

?doc-id: "doc1"
?time: 123324
?df-word: "This"

Each predicate within the scope of the query can use any bound value or add new bound variables to the scope of the query. The final set of bound values that are emitted is defined by the output vector.

We defined three queries, each calculating a portion of the value required for the TF-IDF algorithm. These are fed from two single taps, which are files stored in the Hadoop filesystem. The document file is stored using Apache Avro, which provides a high-performance and dynamic serialization layer. Avro takes a record definition and enables serialization/deserialization based on it. The record structure, in this case, is for a document and is defined as follows:

{"namespace": "storm.cookbook",
 "type": "record",
 "name": "Document",
 "fields": [
     {"name": "docid", "type": "string"},
     {"name": "time",  "type": "long"},
     {"name": "line", "type": "string"}
 ]
}

Both the stop words and documents are fed through an ETL function that emits a clean set of words that have been filtered. The words are derived by splitting the line field using a regular expression:

(defmapcatop split [line]
  (s/split line #"[[](),.)s]+"))

The ETL function is also a query, which serves as a source for our downstream queries, and defines the [?doc-id ?time ?word] output fields.

The output tap, or sink, is based on the Cassandra scheme. A query defines predicate logic, not the source and destination of data. The sink ensures that the outputs of our queries are sent to Cassandra. The ?- macro executes a query, and it is only at execution time that a query is bound to its source and destination, again allowing for extreme levels of composition. The following, therefore, executes the TF query and outputs to Cassandra:

(?- (create-tf-tap cassandra-ip)
        (TF src))

There's more…

The Avro test data was created using the test data from the Cascading tutorial at http://www.cascading.org/2012/07/31/cascading-for-the-impatient-part-5/. Within this tutorial is the rain.txt tab-separated data file. A new column was created called time that holds the Unix epoc time in milliseconds. The updated text file was then processed using some basic Java code that leverages Avro:

Schema schema = Schema.parse(SandboxMain.class.getResourceAsStream
("/document.avsc"));
      File file = new File("document.avro");
      DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<
GenericRecord>(schema);
      DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<
GenericRecord>(datumWriter);
      dataFileWriter.create(schema, file);
      BufferedReader reader = new BufferedReader(new InputStreamReader
(SandboxMain.class.getResourceAsStream("/rain.txt")));
      String line = null;

      try {
         while ((line = reader.readLine()) != null) {
            String[] tokens = line.split("t");
            GenericRecord docEntry = new GenericData.Record(schema);
            docEntry.put("docid", tokens[0]);
            docEntry.put("time", Long.parseLong(tokens[1]));
            docEntry.put("line", tokens[2]);
            dataFileWriter.append(docEntry);
         }
      } catch (IOException e) {
         e.printStackTrace();
      }
      dataFileWriter.close();

Persisting documents from Storm

In the previous recipe, we looked at deriving precomputed views of our data taking some immutable data as the source. In that recipe, we used statically created data. In an operational system, we need Storm to store the immutable data into Hadoop so that it can be used in any preprocessing that is required.

How to do it…

As each tuple is processed in Storm, we must generate an Avro record based on the document record definition and append it to the data file within the Hadoop filesystem.

We must create a Trident function that takes each document tuple and stores the associated Avro record.

Within the tfidf-topology project created in, inside the storm.cookbook.tfidf.function package, create a new class named PersistDocumentFunction that extends BaseFunction. Within the prepare function, initialize the Avro schema and document writer:

public void prepare(Map conf, TridentOperationContext context) {
      try {
         String path = (String) conf.get("DOCUMENT_PATH");
         schema = Schema.parse(PersistDocumentFunction.class
               .getResourceAsStream("/document.avsc"));
         File file = new File(path);
         DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter
<GenericRecord>(schema);
         dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
         if(file.exists())
            dataFileWriter.appendTo(file);
         else
            dataFileWriter.create(schema, file);
      } catch (IOException e) {
         throw new RuntimeException(e);
      }

   }

As each tuple is received, coerce it into an Avro record and add it to the file:

public void execute(TridentTuple tuple, TridentCollector collector) {
      GenericRecord docEntry = new GenericData.Record(schema);
      docEntry.put("docid", tuple.getStringByField("documentId"));
      docEntry.put("time", Time.currentTimeMillis());
      docEntry.put("line", tuple.getStringByField("document"));
      try {
         dataFileWriter.append(docEntry);
         dataFileWriter.flush();
      } catch (IOException e) {
         LOG.error("Error writing to document record: " + e);
         throw new RuntimeException(e);
      }

   }

Next, edit the TermTopology.build topology and add the function to the document stream:

documentStream.each(new Fields("documentId","document"),
            new PersistDocumentFunction(), new Fields());

Finally, include the document path into the topology configuration:
```
conf.put("DOCUMENT_PATH", "document.avro");
```

How it works…

There are various logical streams within the topology, and certainly the input for the topology is not in the appropriate state for the recipes in this article containing only URLs. We therefore need to select the correct stream from which to consume tuples, coerce these into Avro records, and serialize them into a file.

The previous recipe will then periodically consume this file. Within the context of the topology definition, include the following code:

Stream documentStream = getUrlStream(topology, spout)
            .each(new Fields("url"),
                  new DocumentFetchFunction(mimeTypes),
                  new Fields("document", "documentId", "source"));

      documentStream.each(new Fields("documentId","document"),
            new PersistDocumentFunction(), new Fields());

The function should consume tuples from the document stream whose tuples are populated with already fetched documents.