Yesterday, at the Spark + AI summit, the team at Apache Spark announced .NET for Apache Spark, a popular open source distributed processing engine used for analytics over large data sets. It can also be used for processing real-time streams, batches of data, machine learning, and ad-hoc query.
.NET fo Apache Spark for developers
.NET for Apache Spark aims at making Apache Spark accessible to .NET developers across all Spark APIs. The team at Apache Spark aims to develop .NET for Apache Spark in the open (as a .NET Foundation member project) along with the Spark and .NET community for the developers.
.NET for Apache Spark comes with high-performance APIs for using Spark from C# and F#. With .NET APIs, users can now access all aspects of Apache Spark including streaming, Spark SQL, DataFrames, MLLib, etc. It lets the developers reuse all the skills, code, knowledge, and libraries. The C#/ F# language that binds to Spark will be written on a new Spark interop layer that will offer easier extensibility. .NET for Apache Spark can be used on Linux, macOS, and Windows and is compliant with .NET Standard 2.0.
.NET for Apache Spark performance
The first preview version of .NET for Apache Spark performs well on the popular TPC-H benchmark. This benchmark consists of a suite of business-oriented queries. .NET for Apache Spark has a better performance against Python and Scala. It is also 2 times faster than Python.
What more features can be expected?
In the future, the team aims to simplify the documentation and samples and work towards native integration with developer tools such as Visual Studio, Visual Studio Code, Jupyter notebooks. Developers can also expect .NET support for user-defined aggregate functions and .NET idiomatic APIs for C# and F# (e.g., using LINQ for writing queries). The team is also working towards adding support for Azure Databricks, Kubernetes, etc. and making .NET for Apache Spark part of Spark Core.
Few users are excited about this news and are expecting some major improvement with .NET for Spark. A user commented on HackerNews, “I’ve seen the announcement about .NET interior support in Apache Spark some time ago. The benchmarks are interesting and tell the story – in few cases it is faster than Python, but slower than native (for Spark) Scala/JVM. Maybe with Arrow interchange Python’s performance would increase (and for other interpose that would use Array – i.e. for .Net).”
Few others are confused about the transition, as they have to get their teams shifted to the new setup. Another user commented, “Indeed the real sad part is you can’t lead teams there early (premature optimization). Everybody seems to make the same rough transition on their own.”
To know more about this news, check out the post by Apache Spark.