Stream me up, Scotty!

October 6, 2017 - 12:00 am

1844

9 min read

[box type=”note” align=”aligncenter” class=”” width=””]The following is an excerpt from the book Scala and Spark for Big Data Analytics, Chapter 9, Stream me up, Scotty – Spark Streaming written by Md. Rezaul Karim and Sridhar Alla. It explores the big three stream processing paradigms that are in use today. [/box]

In today’s world of interconnected devices and services, it is hard to spend even a few hours a day without our smartphone to check Facebook, or hail an Uber ride, or tweet something about the burger we just bought, or check the latest news or sports updates on our favorite team. We depend on our phones and Internet, for a lot of things, whether it is to get work done, or just browse, or e-mail a friend. There is simply no way around this phenomenon, and the number and variety of applications and services will only grow over time.

Learn Programming & Development with a Packt Subscription

As a result, the smart devices are everywhere, and they generate a lot of data all the time. This phenomenon, also broadly referred to as the Internet of Things, has changed the dynamics of data processing forever. Whenever you use any of the services or apps on your iPhone, or Droid or Windows phone, in some shape or form, real-time data processing is at work. Since so much depends on the quality and value of the apps, there is a lot of emphasis on how the various startups and established companies are tackling the complex challenges of SLAs (Service Level Agreements), and usefulness and also the timeliness of the data.

One of the paradigms being researched and adopted by organisations and service providers is the building of very scalable, near real-time or real-time processing frameworks on cutting-edge platforms or infrastructure. Everything must be fast and also reactive to changes and failures. You won’t like it if your Facebook updated once every hour or if you received email only once a day; so, it is imperative that data flow, processing, and the usage are all as close to real time as possible. Many of the systems we are interested in monitoring or implementing, generate a lot of data as an indefinite continuous stream of events.

As in any data processing system, we have the same fundamental challenges of data collection, storage, and data processing. However, the additional complexity is due to the real-time needs of the platform. In order to collect such indefinite streams of events and then subsequently process all such events to generate actionable insights, we need to use highly scalable specialized architectures to deal with tremendous rates of events. As such, many systems have been built over the decades starting from AMQ, RabbitMQ, Storm, Kafka, Spark, Flink, Gearpump, Apex, and so on.

Modern systems built to deal with such large amounts of streaming data come with very flexible and scalable technologies that are not only very efficient but also help realize the business goals much better than before. Using such technologies, it is possible to consume data from a variety of data sources and then use it in a variety of use cases almost immediately or at a later time as needed.

Let us talk about what happens when you book an Uber ride on your smartphone to go to the airport. With a few touches on the smartphone screen, you’re able to select a point, choose the credit card, make the payment, and book the ride. Once you’re done with your transaction, you then get to monitor the progress of your car real-time on a map on your phone. As the car is making its way toward you, you’re able to monitor exactly where the car is and you can also make a decision to pick up coffee at the local Starbucks while you’re waiting for the car to pick you up.

You could also make informed decisions regarding the car and the subsequent trip to the airport by looking at the expected time of arrival of the car. If it looks like the car is going to take quite a bit of time picking you up, and if this poses a risk to the flight you are about to catch, you could cancel the ride and hop in a taxi that just happens to be nearby.

Alternatively, if it so happens that the traffic situation is not going to let you reach the airport on time, thus posing a risk to the flight you are due to catch, you also get to make a decision regarding rescheduling or canceling your flight.

Now in order to understand how such real-time streaming architectures such as Uber’s Apollo work to provide such invaluable information, we need to understand the basic tenets of streaming architectures. On the one hand, it is very important for a real-time streaming architecture to be able to consume extreme amounts of data at very high rates while, on the other hand, also ensuring reasonable guarantees that the data that is getting ingested is also processed. The following diagram shows a generic stream processing system with a producer putting events into a messaging system while a consumer is reading from the messaging system.

Processing of real-time streaming data can be categorized into the following three essential paradigms:

At least once processing
At most once processing
Exactly once processing

Let’s look at what these three stream processing paradigms mean to our business use cases. While exactly once processing of real-time events is the ultimate nirvana for us, it is very difficult to always achieve this goal in different scenarios. We have to compromise on the property of exactly once processing in cases where the benefit of such a guarantee is outweighed by the complexity of the implementation.

Stream Processing Paradigm 1: At least once processing

The at least once processing paradigm involves a mechanism to save the position of the last event received only after the event is actually processed and results persisted somewhere so that, if there is a failure and the consumer restarts, the consumer will read the old events again and process them. However, since there is no guarantee that the received events were not processed at all or partially processed, this causes a potential duplication of events as they are fetched again. This results in the behavior that events get processed at least once. At least once is ideally suitable for any application that involves updating some instantaneous ticker or gauge to show current values. Any cumulative sum, counter, or dependency on the accuracy of aggregations (sum, groupBy, and so on) does not fit the use case for such processing simply because duplicate events will cause incorrect results. The sequence of operations for the consumer are as follows:

Save results
Save offsets

Below is an illustration of what happens if there is a failure and consumer restarts. Since the events have already been processed but the offsets have not been saved, the consumer will read from the previous offsets saved, thus causing duplicates. Event 0 is processed twice in the following figure:

Stream Processing Paradigm 2: At most once processing

The at-most-once processing paradigm involves a mechanism to save the position of the last event received before the event is actually processed and results persisted somewhere so that, if there is a failure and the consumer restarts, the consumer will not try to read the old events again. However, since there is no guarantee that the received events were all processed, this causes potential loss of events as they are never fetched again. This results in the behavior that the events are processed at most once or not processed at all.

At most once is ideally suitable for any application that involves updating some instantaneous ticker or gauge to show current values, as well as any cumulative sum, counter, or other aggregation, provided accuracy is not mandatory or the application needs absolutely all events. Any events lost will cause incorrect results or missing results.

The sequence of operations for the consumer are as follows:

Save offsets
Save results

Below is an illustration of what happens if there are a failure and the consumer restarts. Since the events have not been processed but offsets are saved, the consumer will read from the saved offsets, causing a gap in events consumed. Event 0 is never processed in the following figure:

Stream Processing Paradigm 3: Exactly once processing

The Exactly once processing paradigm is similar to the at least once paradigm, and involves a mechanism to save the position of the last event received only after the event has actually been processed and the results persisted somewhere so that, if there is a failure and the consumer restarts, the consumer will read the old events again and process them. However, since there is no guarantee that the received events were not processed at all or were partially processed, this causes a potential duplication of events as they are fetched again. However, unlike the at least once paradigm, the duplicate events are not processed and are dropped, thus resulting in the exactly once paradigm. Exactly once processing paradigm is suitable for any application that involves accurate counters, aggregations, or which, in general, needs every event processed only once and also definitely once (without loss).

The sequence of operations for the consumer are as follows:

Save results
Save offsets

The following is illustration shows what happens if there are a failure and the consumer restarts. Since the events have already been processed but offsets have not saved, the consumer will read from the previous offsets saved, thus causing duplicates. Event 0 is processed only once in the following figure because the consumer drops the duplicate event 0:

How does the exactly once paradigm drop duplicates? There are two techniques which can help here:

Idempotent updates
Transactional updates

Idempotent updates involve saving results based on some unique ID/key generated so that, if there is a duplicate, the generated unique ID/key will already be in the results (for instance, a database) so that the consumer can drop the duplicate without updating the results. This is complicated as it’s not always possible or easy to generate unique keys. It also requires additional processing on the consumer end. Another point is that the database can be separate for results and offsets.

Transactional updates save results in batches that have a transaction beginning and a transaction commit phase within so that, when the commit occurs, we know that the events were processed successfully. Hence, when duplicate events are received, they can be dropped without updating results. This technique is even more complicated than the idempotent updates as now we need some transactional data store. Another point is that the database must be the same for results and offsets.

You should look into the use case you’re trying to build and see if ‘at least once processing’, or ‘at most once processing’, can be reasonably wide and still achieve an acceptable level of performance and accuracy.

If you enjoyed this excerpt, be sure to check out the book Scala and Spark for Big Data Analytics it appears in. You will also like this exclusive interview on why Spark is ideal for stream processing with Romeo Kienzler, Chief Data Scientist in the IBM Watson IoT worldwide team and author of Mastering Apache Spark, 2nd Edition.

Stream me up, Scotty!

Stream Processing Paradigm 1: At least once processing

Stream Processing Paradigm 2: At most once processing

Stream Processing Paradigm 3: Exactly once processing

NO COMMENTS

LEAVE A REPLY Cancel reply