Google released a beta version of SparkR jobs on Cloud Dataproc, a cloud service that lets you run Apache Spark and Apache Hadoop in a cost-effective manner, earlier this week.
SparkR Jobs will build R support on GCP. It is a package that delivers a lightweight front-end to use Apache Spark from R. This new package supports distributed machine learning using MLlib. It can be used to process against large cloud storage datasets and for performing work that is computationally intensive. Moreover, this new package also allows the developers to use “dplyr-like operations” i.e. a powerful R-package, which transforms and summarizes tabular data with rows and columns on datasets stored in Cloud Storage.
The R programming language is very efficient when it comes to building data analysis tools and statistical apps. With cloud computing all the rage, even newer opportunities have opened up for developers working with R.
Using GCP’s Cloud Dataproc Jobs API, it gets easier to submit SparkR jobs to a cluster without any need to open firewalls for accessing web-based IDEs or SSH onto the master node. With the API, it is easy to automate the repeatable R statistics that users want to be running on their datasets.
Additionally, GCP for R also helps avoid the infrastructure barriers that put a limit on understanding data. This includes selecting datasets that need to be sampled due to compute or data size limits. GCP also allows you to build large-scale models that help analyze the datasets of sizes that would previously require big investments in high-performance computing infrastructures.
For more information, check out the official Google Cloud blog post.