Last week, the team at Apache announced that Alibaba decided to contribute its Flink-fork, called Blink, back to the Apache Flink project.
Apache Flink has been following the philosophy of taking a unified approach to batch and streaming data processing. The core building block is “continuous processing of unbounded data streams.” With this continuous processing, users can also do offline processing of bounded data sets.
The batch is considered as the special case of streaming and is supported by various projects such as Flink, Beam, etc. It is known as powerful way of building data applications that generalize across real-time and offline processing and further reduces the complexity of data infrastructures.
“Batch is just a special case of streaming does not mean that any stream processor is now the right tool for your batch processing use cases.” Pure stream processing systems are slow at batch processing workloads. A stream processor that shuffles through message queues to analyze large amounts of available data is not useful.
Unified APIs such as Apache Beam delegate to different runtimes based on whether the data is continuous/unbounded of fix/bounded. For example, the implementations of the batch and streaming runtime of Google Cloud Dataflow are different, to get the desired performance and resilience in each case. Apache Flink has a streaming API that can do bounded/unbounded use cases and also offers a separate DataSet API and runtime stack which is faster for batch use cases.
To make Flink’s experience on bounded data (batch) state-of-the-art, few enhancements are required.
Currently, the bounded and unbounded operators have a different network and threading model which doesn’t mix and match. Continuous streaming operators are the foundation in a unified stack. While operating on bounded data without latency constraints, the API or the query optimizer can easily select from a larger set of operators.
While input data is bounded, it is possible to completely buffer data during shuffles and also replay that data after a failure. This makes recovery fine-grained and much more efficient.
A continuous unbounded streaming application needs all the operators that are running at the same time. An application on bounded data can schedule operations depending on how the operators consume data which increases resource efficiency.
Currently, only the Table API activates the optimizations while working on bounded data.
In order to be competitive with the best batch engines, Flink needs more coverage and performance for the SQL query execution. As the core data-plane in Flink is high performance, the speed of SQL execution depends on optimizer rules, a rich set of operators, and also features like code generation.
As Blink’s code is currently available as a branch in the Apache Flink repository, it is difficult to merge big amount of changes and making the merge process as non-disruptive as possible.
The merge plan focuses on the bounded/batch processing features and follows the following approach to ensure a smooth integration:
To know more, check out Apache’s official post.
LLVM officially migrating to GitHub from Apache SVN
Apache NetBeans IDE 10.0 released with support for JDK 11, JUnit 5 and more!
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…