Categories: Tutorials

Launching a Spark Cluster

6 min read

In this article by Omar Khedher, author of OpenStack Sahara Essentials we will use Sahara to create and launch a Spark cluster. Sahara provides several plugins to provision Hadoop clusters on top of OpenStack. We will be using Spark plugins to provision Apache Spark clusters using Horizon.

(For more resources related to this topic, see here.)

General settings

The following diagram illustrates our Spark cluster topology, which includes:

One Spark master node: This runs the Spark Master and the HDFS NameNode
Three Spark slave nodes: These run a Spark Slave and an HDFS DataNode each

Preparing the Spark image

The following link provides several Sahara images available for download for different plugins:

http://sahara-files.mirantis.com/images/upstream/liberty.

Note that the upstream Sahara image files are destined for the OpenStack Liberty release. From Horizon, click on Compute and select Images, click on Create Image and add the new image, as shown here:

We will need to upload the downloaded image to Glance so that it can be registered in the Sahara image registry catalog. Make sure that the new image is active. Click on the Data Processing tab and select Image Registry. Click on Register Image to register the new uploaded Glance image to Sahara, as shown here:

Click on Done and the new Spark image is ready to start launching the Spark cluster.

Creating the Spark master group template

Node group templates in Sahara facilitate the configuration of a set of instances that have same properties, such as RAM and CPU. We will start by creating the first node group template for the Spark master. From the Data Processing tab, select Node Group Templates and click on Create Template. Our first node group template will be based on Apache Spark with Version 1.3.1, as shown here:

The next wizard will guide to specifying the name of the template, the instance flavor, the storage location, and which floating IP pool will be assigned to the cluster instance:

The next tab in same wizard will guide you to selecting which kind of process the nodes in the cluster will run. In our case, the Spark master node group template will include Spark master and HDFS namenode processes, as shown here:

The next tab in the wizard exposes more choices regarding the security groups that will be applied for the template cluster nodes:

Auto security group: This will automatically create a set of security groups that will be directly applied to the instances of the node group template
Default security group: Any existing security groups in the OpenStack environment configured as default will be applied the instances of the node group template

The last tab in the wizard exposes more specific HDFS configuration that depend on the available resources of the cluster, such as disk space, CPU and memory:

dfs.datanode.handler.count: How many server threads there are for the datanode
dfs.datanode.du.reserved: How much of the available disk space will not be taken into account for HDFS use
dfs.namenode.handler.count: How many server threads there are for the namenode
dfs.datanode.failed.volumes.tolerated: How many volumes are allowed to fail before a datanode instance stops
dfs.datanode.max.xcievers: What is the maximum number of threads to be used in order to transfer data to/from the DataNode instance.
Name Node Heap Size: How much memory will be assigned to the heap size to handle workload per NameNode instance
Data Node Heap Size: How much memory will be assigned to the heap size to handle workload per DataNode instance

Creating the Spark slave group template

Creating the Spark slave group template will be performed in the same way as the Spark master group template except the assignment of the node processes.

The Spark slave nodes will be running Spark slave and HDFS datanode processes, as shown here:

Security groups and HDFS parameters can be configured the same as the Spark master node group template.

Creating the Spark cluster template

Now that we have defined the basic templates for the Spark cluster, we will need to compile both entities into one cluster template. In the Sahara dashboard, select Cluster Templates and click on Create Template. Select Apache Spark as the Plugin name, with version 1.3.1, as follows:

Give the cluster template a name and small description. It is also possible to mention which process in the Spark cluster will run in a different compute node for high-availability purposes. This is only valid when you have more than one compute node in the OpenStack environment.

The next tab in the same wizard allows you to add the necessary number of Spark instances based on the node group templates created previously. In our case, we will use one master Spark instance and three slave Spark instances, as shown here:

The next tab, General Parameters, provides more advanced cluster configuration, including the following:

Timeout for disk preparing: The cluster will fail when the duration of formatting and mounting the disk per node exceeds the timeout value.
Enable NTP service: This option will enable all the instances of the cluster to synchronized time. An NTP file can be found under /tmp when cluster nodes are active.
URL of NTP server: If mentioned, the Spark cluster will use the URL of the NTP server for time synchronization.
Heat Wait Condition timeout: Heat will throw an error message to Sahara and the cluster will fail when a node is not able to boot up after a certain amount of time. This will prevent Sahara spawning instances indefinitely.
Enable XFS: Allows XFS disk formatting.
Decommissioning Timeout: This will throw an error when scaling data nodes in the Spark cluster takes more than the time mentioned.
Enable Swift: Allows using Swift object storage to pull and push data during job execution.

The Spark Parameters tab allows you to specify the following:

Master webui port: Which port will access the Spark master web user interface.
Work webui port: Which port will access the Spark slave web user interface.
Worker memory: How much memory will be reserved for Spark applications. By default, if all is selected, Spark will use all the available RAM is the instance minus 1 GB. Spark will not run properly when using a flavor having RAM less than 1 GB.

Launching the Spark cluster

Based on the cluster template, the last step will require you to only push the button Launch Cluster from the Clusters tab in the Sahara dashboard. You will need only to select the plugin name, Apache Spark, with version 1.3.1. Next, you will need to name the new cluster, select the right cluster template created previously, and the base image registered in Sahara. Additionally, if you intend to access the cluster instances via SSH, select an existing SSH keypair. It is also possible to select from which network segment you will be able to manage the cluster instances; in our case, an existing private network, Private_Net10, will be used for this purpose.

Launch the cluster; this will take a while to finish spawning four instances forming the Spark cluster.

The Spark cluster instances can be listed in the Compute Instances tab, as shown here:

Summary

In this article, we created a Spark cluster using Sahara in OpenStack by means of the Apache Spark plugin. The provisioned cluster includes one Spark master node and three Spark slave nodes. When the cluster status changes to theactive state, it is possible to start executing jobs.