In this article by Omar Khedher, author of OpenStack Sahara Essentials we will use Sahara to create and launch a Spark cluster. Sahara provides several plugins to provision Hadoop clusters on top of OpenStack. We will be using Spark plugins to provision Apache Spark clusters using Horizon.
(For more resources related to this topic, see here.)
The following diagram illustrates our Spark cluster topology, which includes:
The following link provides several Sahara images available for download for different plugins:
http://sahara-files.mirantis.com/images/upstream/liberty.
Note that the upstream Sahara image files are destined for the OpenStack Liberty release. From Horizon, click on Compute and select Images, click on Create Image and add the new image, as shown here:
We will need to upload the downloaded image to Glance so that it can be registered in the Sahara image registry catalog. Make sure that the new image is active. Click on the Data Processing tab and select Image Registry. Click on Register Image to register the new uploaded Glance image to Sahara, as shown here:
Click on Done and the new Spark image is ready to start launching the Spark cluster.
Node group templates in Sahara facilitate the configuration of a set of instances that have same properties, such as RAM and CPU. We will start by creating the first node group template for the Spark master. From the Data Processing tab, select Node Group Templates and click on Create Template. Our first node group template will be based on Apache Spark with Version 1.3.1, as shown here:
The next wizard will guide to specifying the name of the template, the instance flavor, the storage location, and which floating IP pool will be assigned to the cluster instance:
The next tab in same wizard will guide you to selecting which kind of process the nodes in the cluster will run. In our case, the Spark master node group template will include Spark master and HDFS namenode processes, as shown here:
The next tab in the wizard exposes more choices regarding the security groups that will be applied for the template cluster nodes:
The last tab in the wizard exposes more specific HDFS configuration that depend on the available resources of the cluster, such as disk space, CPU and memory:
Creating the Spark slave group template will be performed in the same way as the Spark master group template except the assignment of the node processes.
The Spark slave nodes will be running Spark slave and HDFS datanode processes, as shown here:
Security groups and HDFS parameters can be configured the same as the Spark master node group template.
Now that we have defined the basic templates for the Spark cluster, we will need to compile both entities into one cluster template. In the Sahara dashboard, select Cluster Templates and click on Create Template. Select Apache Spark as the Plugin name, with version 1.3.1, as follows:
Give the cluster template a name and small description. It is also possible to mention which process in the Spark cluster will run in a different compute node for high-availability purposes. This is only valid when you have more than one compute node in the OpenStack environment.
The next tab in the same wizard allows you to add the necessary number of Spark instances based on the node group templates created previously. In our case, we will use one master Spark instance and three slave Spark instances, as shown here:
The next tab, General Parameters, provides more advanced cluster configuration, including the following:
The Spark Parameters tab allows you to specify the following:
Based on the cluster template, the last step will require you to only push the button Launch Cluster from the Clusters tab in the Sahara dashboard. You will need only to select the plugin name, Apache Spark, with version 1.3.1. Next, you will need to name the new cluster, select the right cluster template created previously, and the base image registered in Sahara. Additionally, if you intend to access the cluster instances via SSH, select an existing SSH keypair. It is also possible to select from which network segment you will be able to manage the cluster instances; in our case, an existing private network, Private_Net10, will be used for this purpose.
Launch the cluster; this will take a while to finish spawning four instances forming the Spark cluster.
The Spark cluster instances can be listed in the Compute Instances tab, as shown here:
In this article, we created a Spark cluster using Sahara in OpenStack by means of the Apache Spark plugin. The provisioned cluster includes one Spark master node and three Spark slave nodes. When the cluster status changes to theactive state, it is possible to start executing jobs.
Further resources on this subject:
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…