Batch: a Special Case of Streaming

Last week, the team at Apache announced that Alibaba decided to contribute its Flink-fork, called Blink, back to the Apache Flink project.

A unified approach to Batch and Streaming

Apache Flink has been following the philosophy of taking a unified approach to batch and streaming data processing. The core building block is “continuous processing of unbounded data streams.” With this continuous processing, users can also do offline processing of bounded data sets.

The batch is considered as the special case of streaming and is supported by various projects such as Flink, Beam, etc. It is known as a powerful way of building data applications that generalize across real-time and offline processing and further reduces the complexity of data infrastructures.

“Batch is just a special case of streaming does not mean that any stream processor is now the right tool for your batch processing use cases.” Pure stream processing systems are slow at batch processing workloads. A stream processor that shuffles through message queues to analyze large amounts of available data is not useful.

Unified APIs such as Apache Beam delegate to different runtimes based on whether the data is continuous/unbounded of fix/bounded. For example, the implementations of the batch and streaming runtime of Google Cloud Dataflow are different, to get the desired performance and resilience in each case. Apache Flink has a streaming API that can do bounded/unbounded use cases and also offers a separate DataSet API and runtime stack which is faster for batch use cases.

What can be improved?

To make Flink’s experience on bounded data (batch) state-of-the-art, few enhancements are required.

A truly unified runtime operator stack

Currently, the bounded and unbounded operators have a different network and threading model which doesn’t mix and match. Continuous streaming operators are the foundation in a unified stack. While operating on bounded data without latency constraints, the API or the query optimizer can easily select from a larger set of operators.

Exploiting bounded streams to reduce the scope of fault tolerance

While input data is bounded, it is possible to completely buffer data during shuffles and also replay that data after a failure. This makes recovery fine-grained and much more efficient.

Exploiting bounded stream operator properties for scheduling

A continuous unbounded streaming application needs all the operators that are running at the same time. An application on bounded data can schedule operations depending on how the operators consume data which increases resource efficiency.

Enabling these special case optimizations for the DataStream API

Currently, only the Table API activates the optimizations while working on bounded data.

Performance and coverage for SQL

In order to be competitive with the best batch engines, Flink needs more coverage and performance for the SQL query execution. As the core data-plane in Flink is high performance, the speed of SQL execution depends on optimizer rules, a rich set of operators, and also features like code generation.

Merging Blink and Flink

As Blink’s code is currently available as a branch in the Apache Flink repository, it is difficult to merge big amount of changes and making the merge process as non-disruptive as possible.

The merge plan focuses on the bounded/batch processing features and follows the following approach to ensure a smooth integration:

For merging Blink’s SQL/Table API query processor enhancements, the team can work in an easier way as both Flink and Blink have the same APIs: SQL and the Table API. Following some restructuring of the Table/SQL module, the team plans to merge the Blink query planner (optimizer) and runtime (operators) as an additional query processor next to the current SQL runtime.

Users will be able to select which query processor to use, initially. After a transition period, the current processor will be deprecated and eventually dropped.

The Flink community is working on refactoring its current schedule and adding support for pluggable scheduling and fail-over strategies.

Once this is done, the team can add Blink’s scheduling and recovery strategies as a new scheduling strategy that will be used by the new query processor. The new scheduling strategy will also be used for bounded DataStream programs.

To know more, check out Apache’s official post.

LLVM officially migrating to GitHub from Apache SVN

Apache NetBeans IDE 10.0 released with support for JDK 11, JUnit 5 and more!

Confluent, an Apache Kafka service provider adopts a new license to fight against cloud service providers