6 min read

[box type=”note” align=”” class=”” width=””]In this article by Shilpi Saxena and Saurabh Gupta from their book Practical Real-time data Processing and Analytics we shall explore what a near real time architecture looks like and how an NRT app works. [/box]

It’s very important to understand the key aspects where the traditional monolithic application systems are falling short to serve the need of the hour:

  • Backend DB: Single point monolithic data access.
  • Ingestion flow: The pipelines are complex and tend to induce latency in end to end flow.
  • Systems are failure prone, but the recovery approach is difficult and complex.
  • Synchronization and state capture: It’s very difficult to capture and maintain the state of facts and transactions in the system. Getting diversely distributed systems and real-time system failures further complicate the design and maintenance of such systems.

The answer to the above issues is an architecture that supports streaming and thus provides its end users access to actionable insights in real-time over ever flowing in-streams of real-time fact data.

  • Local state and consistency of the system for large scale high velocity systems
  • Data doesn’t arrive at intervals, it keeps flowing in, and it’s streaming in all the time
  • No single state of truth in the form of backend database, instead the applications subscribe or tap into stream of fact data

Before we delve further, it’s worthwhile to understand the notation of time:

Looking at this figure, it’s very clear to correlate the SLAs with each type of implementation (batch, near real-time, and real-time) and the kinds of use cases each implementation caters to.

For instance, batch implementations have SLAs ranging from a couple of hours to days and such solutions are predominantly deployed for canned/pre-generated reports and trends. The real-time solutions have an SLA of a magnitude of few seconds to hours and cater to situations requiring ad-hoc queries, mid-resolution aggregators, and so on. The real-time application’s most mission-critical in terms of SLA and resolutions are where each event accounts for and the results have to return within an order of milliseconds to seconds.

Near real time (NRT) Architecture

In its essence, NRT Architecture consists of four main components/layers, as depicted in the following figure:

  • The message transport pipeline
  • The stream processing component
  • The low-latency data store
  • Visualization and analytical tools

The first step is the collection of data from the source and providing for the same to the “data pipeline”, which actually is a logical pipeline that collects the continuous events or streaming data from various producers and provides the same to the consumer stream processing applications. These applications transform, collate, correlate, aggregate, and perform a variety of other operations on this live streaming data and then finally store the results in the low-latency data store. Then, there is a variety of analytical, business intelligence, and visualization tools and dashboards that read this data from the data store and present it to the business user.

Data collection

This is the beginning of the journey of all data processing, be it batch or real time the foremost and most forthright is the challenge to get the data from its source to the systems for our processing. If I can look at the processing unit as a black box and a data source, and at consumers as publishers and subscribers. It’s captured in the following diagram:

The key aspects that come under the criteria for data collection tools in the general context of big data and real-time specifically are as follows:

  • Performance and low latency
  • Scalability
  • Ability to handle structured and unstructured data

Apart from this, the data collection tool should be able to cater to data from a variety of sources such as:

  • Data from traditional transactional systems:
  • To duplicate the ETL process of these traditional systems and tap the data from the source
  • Tap the data from these ETL systems

The third and a better approach is to go the virtual data lake architecture for data replication.

  • Structured data from IoT/ Sensors/Devices, or CDRs: This is the data that comes at a very high velocity and in a fixed format – the data can be from a variety of sensors and telecom devices.
  • Unstructured data from media files, text data, social media, and so on: This is the most complex of all incoming data where the complexity is due to the dimensions of volume, velocity, variety, and structure.

Stream processing

The stream processing component itself consists of three main sub-components, which are:

  • The Broker: that collects and holds the events or data streams from the data collection agents.
  • The “Processing Engine”: that actually transforms, correlates, aggregates the data, and performs the other necessary operations
  • The “Distributed Cache”: that actually serves as a mechanism for maintaining common data set across all distributed components of the processing engine

The same aspects of the stream processing component are zoomed out and depicted in the diagram as follows:

There are few key attributes that should be catered to by the stream processing component:

  • Distributed components thus offering resilience to failures
  • Scalability to cater to growing need of the application or sudden surge of traffic
  • Low latency to handle the overall SLAs expected from such application
  • Easy operationalization of use case to be able to support the evolving use cases
  • Build for failures, the system should be able to recover from inevitable failures without any event loss, and should be able to reprocess from the point it failed
  • Easy integration points with respect to off-heap/distributed cache or data stores
  • A wide variety of operations, extensions, and functions to work with business requirements of the use case

Analytical layer – serve it to the end user

The analytical layer is the most creative and interesting of all the components of an NRT application. So far, all we have talked about is backend processing, but this is the layer where we actually present the output/insights to the end user graphically, visually in form of an actionable item.

A few of the challenges these visualization systems should be capable of handling are:

  • Need for speed
  • Understanding the data and presenting it in the right context
  • Dealing with outliers

The figure depicts the flow of information from event producers to the collection agents, followed by the brokers and processing engine (transformation, aggregation, and so on) and then the long-term storage. From the storage unit, the visualization tools reap the insights and present them in form of graphs, alerts, charts, Excel sheets, dashboards, or maps, to the business owners who can assimilate the information and take some action based upon it.

The above was an excerpt from the book Practical Real-time data Processing and Analytics.

LEAVE A REPLY

Please enter your comment!
Please enter your name here