This article is an excerpt from a book written by Muhammad Asif Abbasi titled Learning Apache Spark 2. In this book, you will learn how to perform big data analytics using Spark streaming, machine learning techniques and more.
From the article given below, you will learn how to operate Spark in Mesos cluster manager.
What is Mesos?
Mesos is an open source cluster manager started as a UC Berkley research project in 2008 and quite widely used by a number of organizations. Spark supports Mesos, and Matei Zahria has given a keynote at Mesos Con in June of 2016. Here is a link to the YouTube video of the keynote.
Before you start
If you haven’t installed Mesos previously, the getting started page on the Apache website gives a good walk through of installing Mesos on Windows, MacOS, and Linux. Follow the URL https://mesos.apache.org/getting-started/.
- Once installed you need to start-up Mesos on your cluster
- Starting Mesos Master: ./bin/mesos-master.sh -ip=[MasterIP] -workdir=/var/lib/mesos
- Start Mesos Agents on all your worker nodes: ./bin/mesos-agent.sh – master=[MasterIp]:5050 -work-dir=/var/lib/mesos
- Make sure Mesos is up and running with all your relevant worker nodes configured: http://[MasterIP]@5050
Make sure that Spark binary packages are available and accessible by Mesos. They can be placed on a Hadoop-accessible URI for example:
- HTTP via http://
- S3 via s3n://
- HDFS via hdfs://
You can also install spark in the same location on all the Mesos slaves, and configure spark.mesos.executor.home to point to that location.
Running in Mesos
Mesos can have single or multiple masters, which means the Master URL differs when submitting application from Spark via mesos:
- Single Master
- Multiple Masters (Using Zookeeper)
- Mesos://zk://master1:2181, master2:2181/mesos
Modes of operation in Mesos
Mesos supports both the Client and Cluster modes of operation:
Before running the client mode, you need to perform couple of configurations:
- Export MESOS_NATIVE_JAVA_LIBRARY=<Path to libmesos.so [Linux]> or <Path to libmesos.dylib[MacOS]>
- Export SPARK_EXECUTOR_URI=<URI of Spark zipped file uploaded to an accessible location e.g. HTTP, HDFS, S3>
- Set spark.executor.uri to URI of Spark zipped file uploaded to an accessible location e.g. HTTP, HDFS, S3
For batch applications, in your application program you need to pass on the Mesos URL as the master when creating your Spark context. As an example:
val sparkConf = new SparkConf()
.set(“spark.executor.uri”, “Location to Spark binaries
(Http, S3, or HDFS)”)
val sc = new SparkContext(sparkConf)
If you are using Spark-submit, you can configure the URI in the conf/sparkdefaults.conf file using spark.executor.uri.
When you are running one of the provided spark shells for interactive querying, you can pass the master argument e.g:
./bin/spark-shell -master mesos://mesosmaster:5050
Just as in YARN, you run spark on mesos in a cluster mode, which means the driver is launched inside the cluster and the client can disconnect after submitting the application, and get results from the Mesos WebUI.
Steps to use the cluster mode
- Start the MesosClusterDispatcher in your cluster: ./sbin/start-mesos-dispatcher.sh -master mesos://mesosmaster:5050. This will generally start the dispatcher at port 7077.
- From the client, submit a job to the mesos cluster by Spark-submit specifying the dispatcher URL.
Similar to Spark Mesos has lots of properties that can be set to optimize the processing. You should refer to the Spark Configuration page (http://spark.apache.org/docs/latest/configuration.html) for more Information.
Mesos run modes
Spark can run on Mesos in two modes:
- Coarse Grained (default-mode): Spark will acquire a long running Mesos task on each machine. This offers a much cost of statup, but the resources will continue to be allocated to spark for the complete duration of the application.
- Fine Grained (deprecated): The fine grained mode is deprecated as in this case each mesos task is created per Spark task. The benefit of this is each application receives cores as per its requirements, but the initial bootstrapping might act as a deterrent for interactive applications.
Key Spark on Mesos configuration properties
While Spark has a number of properties that can be configured to optimize Spark processing, some of these properties are specific to Mesos. We’ll look at few of those key properties here.
|Property Name||Meaning/Default Value|
|spark.mesos.coarse||Setting it to true (default value), will run
Mesos in coarse grained mode. Setting it to
false will run it in fine-grained mode.
|spark.mesos.extra.cores||This is more of an advertisement rather than
allocation in order to improve parallelism. An
executor will pretend that it has extra cores
resulting in the driver sending it more work.
|spark.mesos.mesosExecutor.cores||Only works in fine grained mode. This
specifies how many cores should be given to
each Mesos executor.
|spark.mesos.executor.home||Identifies the directory of Spark installation
for the executors in Mesos. As discussed, you can specify this using
spark.executor.uri as well, however if
you have not specified it, you can specify it
using this property.
|spark.mesos.executor.memoryOverhead||The amount of memory (in MBs) to be
allocated per executor.
|spark.mesos.uris||A comma separated list of URIs to be
downloaded when the driver or executor is
launched by Mesos.
|spark.mesos.prinicipal||The name of the principal used by Spark to
authenticate itself with Mesos.
You can find other configuration properties at the Spark documentation page (http://spark.apache.org/docs/latest/running-on-mesos.html#spark-properties).
To summarize, we covered the objective to get you started with running Spark on Mesos.
To know more about Spark SQL, Spark Streaming, Machine Learning with Spark, you can refer to the book Learning Apache Spark 2.