How to maintain Apache Mesos

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by David Blomquist and Tomasz Janiszewski, titled Apache Mesos Cookbook. Throughout the course of the book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box]

In this article, we will learn about configuring logging options, setting up monitoring ecosystem, and upgrading your Mesos cluster.

Logging and debugging

Here we will configure logging options that will allow us to debug the state of Mesos.

Getting ready

We will assume Mesos is available on localhost port 5050. The steps provided here will work for either master or agents.

How to do it...

When Mesos is installed from pre-built packages, the logs are by default stored in /var/log/mesos/. When installing from a source build, storing logs is disabled by default. To change the log store location, we need to edit /etc/default/mesos and set the LOGS variable to the desired destination. For some reason, mesos-init-wrapper does not transfer the contents of /etc/mesos/log_dir to the --log_dir flag. That's why we need to set the log's destination in the environment variable.

Remember that only Mesos logs will be stored there. Logs from third-party applications (for example, ZooKeeper) will still be sent to STDERR.

Changing the default logging level can be done in one of two ways: by specifying the -- logging_level flag or by sending a request and changing the logging level at runtime for a specific period of time.

For example, to change the logging level to INFO, just put it in the following code:

/etc/mesos/logging_level

echo INFO > /etc/mesos/logging_level

The possible levels are INFO, WARNING, and ERROR.

For example, to change the logging level to the most verbose for 15 minutes for debug purposes, we need to send the following request to the logging/toggle endpoint:

curl -v -X POST localhost:5050/logging/toggle?level=3&duration=15mins

How it works...

Mesos uses the Google-glog library for debugging, but third-party dependencies such as ZooKeeper have their own logging solution. All configuration options are backed by glog and apply only to Mesos core code.

Monitoring

Now, we will set up monitoring for Mesos.

Getting ready

We must have a running monitoring ecosystem. Metrics storage could be a simple time- series database such as graphite, influxdb, or prometheus. In the following example, we are using graphite and our metrics are published with http://diamond.readthedocs.io/en/latest/.

How to do it...

Monitoring is enabled by default. Mesos does not provide any way to automatically push metrics to the registry. However, it exposes them as a JSON that can be periodically pulled and saved into the metrics registry:

Install Diamond using following command: pip install diamond

If additional packages are required to install them, run: sudo apt-get install python-pip python-dev build-essential.

pip (Pip Installs Packages) is a Python package manager used to install software written in Python.

Configure the metrics handler and interval. Open /etc/diamond/diamond.conf and ensure that there is a section for graphite configuration:

[handler_graphite]

class = handlers.GraphiteHandler host = <graphite.host>

port = <graphite.port>

Remember to replace graphite.host and graphite.port with real graphite details.

Enable the default Mesos Collector. Create configuration files diamond-setup - C MesosCollector. Check whether the configuration has proper values and edit them if needed. The configuration can be found in /etc/diamond/collectors/MesosCollector.conf. On master, this file should look like this: enabled = True host = localhost port = 5050 While on agent, the port could be different (5051), as follows: enabled = True host = localhost port = 5051

How it works...

Mesos exposes metrics via the HTTP API. Diamond is a small process that periodically pulls metrics, parses them, and sends them to the metrics registry, in this case, graphite. The default implementation of Mesos Collector does not store all the available metrics so it's recommended to write a custom handler that will collect all the interesting information.

Upgrading Mesos

In this recipe, you will learn how to upgrade your Mesos cluster.

How to do it...

Mesos release cadence is at least one release per quarter. Minor releases are backward compatible, although there could be some small incompatibilities or the dropping of deprecated methods. The recommended method of upgrading is to apply all intermediate versions. For example, to upgrade from 0.27.2 to 1.0.0, we should apply 0.28.0, 0.28.1, 0.28.2, and finally 1.0.0.

If the agent's configuration changes, clearing the metadata directory is required. You can do this with the following code:

rm -rv {MESOS_DIR}/metadata

Here, {MESOS_DIR} should be replaced with the configured Mesos directory. Rolling upgrades is the preferred method of upgrading clusters, starting with masters and then agents.

To minimize the impact on running tasks, if an agent's configuration changes and it becomes inaccessible, then it should be switched to maintenance mode.

How it works...

Configuration changes may require clearing the metadata because the changes may not be backward compatible. For example, when an agent runs with different isolators, it shouldn't attach to the already running processes without this isolator. The Mesos architecture will guarantee that the executors that were not attached to the Mesos agent will commit suicide after a configurable amount of time (--executor_registration_timeout).

Maintenance mode allows you to declare the time window during which the agent will be inaccessible. When this occurs, Mesos will send a reverse offer to all the frameworks to drain that particular agent. The frameworks are responsible for shutting down its task and spawning it on another agent. The Maintenance mode is applied, even if the framework does not implement the HTTP API or is explicitly declined. Using maintenance mode can prevent restarting tasks multiple times.

Consider the following example with five agents and one task, X. We schedule the rolling upgrade of all the agents. Task X is deployed on agent 1. When it goes down, it's moved to 2, then to 3, and so on. This approach is extremely inefficient because the task is restarted five times, but it only needs to be restarted twice. Maintenance mode enables the framework to optimally schedule the task to run on agent 5 when 1 goes down, and then return to 1 when 5 goes down:

Worst case scenario of rolling upgrade without maintenance mode legend optimal solution of rolling upgrade with maintenance mode.

We have learnt about running and maintaining Mesos. To know more about managing containers and understanding the scheduler API you may check out this book, Apache Mesos Cookbook.