21 min read

In this article, by Francesco Pierfederici author of the book Distributed Computing with Python, the author likes to state that, “distributed systems, both large and small, can be extremely challenging to test and debug, as they are spread over a network, run on computers that can be quite different from each other, and might even be physically located in different continents altogether”.

Moreover, the computers we use could have different user accounts, different disks with different software packages, different hardware resources, and very uneven performance. Some can even be in a different time zone. Developers of distributed systems need to consider all these pieces of information when trying to foresee failure conditions. Operators have to work around all of these challenges when debugging errors.

(For more resources related to this topic, see here.)

The big picture

Testing and debugging monolithic applications is not simple, as every developer knows. However, there are a number of tools that dramatically make the task easier, including the pdb debugger, various profilers (notable mentions include cProfile and line_profile), linters, static code analysis tools, and a host of test frameworks, a number of which have been included in the standard library of Python 3.3 and higher.

The challenge with distributed applications is that most of the tools and packages that we can use for single-process applications lose much of their power when dealing with multiple processes, especially when these processes run on different computers.

Debugging and profiling distributed applications written in C, C++, and Fortran can be done with tools such as Intel VTune, Allinea MAP, and DDT. Unfortunately, Python developers are left with very few or no options for the time being.

Writing small- or medium-sized distributed systems is not terribly hard, as we saw in the pages so far. The main difference between writing monolithic programs and distributed applications is the large number of interdependent components running on remote hardware. This is what makes monitoring and debugging distributed code harder and less convenient.

Fortunately, we can still use all familiar debuggers and code analysis tools on our Python distributed applications. Unfortunately, these tools will only go so far to the point that we will have to rely on old-fashioned logging and print statements to get the full picture on what went wrong.

Common problems – clocks and time

Time is a handy variable for use. For instance, using timestamps is very natural when we want to join different streams of data, sort database records, and in general, reconstruct the timeline for a series of events, which we often times observe are out of order. In addition, some tools (for example, GNU make) rely solely on file modification time and are easily confused by a clock skew between machines.

For these reasons, clock synchronization among all computers and systems we use is very important. If our computers are in different time zones, we might want to not only synchronize their clocks but also set them to Coordinated Universal Time (UTC) for simplicity. In all the cases, when changing clocks to UTC is not possible, a good advice is to always process time in UTC within our code and to only covert local time for display purposes.

In general, clock synchronization in distributed systems is a fascinating and complex topic, and it is out of the scope of this article. Most readers, luckily, are likely to be well served by the Network Time Protocol (NTP), which is a perfectly fine synchronization solution for most situations. Most modern operating systems, including Windows, Mac OS X, and Linux, have great support for NTP.

Another thing to consider when talking about time is the timing of periodic actions, such as polling loops or cronjobs. Many applications need to spawn processes or perform actions (for example, sending a confirmation e-mail or checking whether new data is available) at regular intervals.

A common pattern is to set up timers (either in our code or via the tools provided by the OS) and have all these timers go off at some time, usually at a specific hour and at regular intervals after that. The risk of this approach is that we can overload the system the very moment all these processes wake up and start their work.

A surprisingly common example would be starting a significant number of processes that all need to read some configuration or data file from a shared disk. In these cases, everything goes fine until the number of processes becomes so large that the shared disk cannot handle the data transfer, thus slowing our application to a crawl.

A common solution is the staggering of these timers in order to spread them out over a longer time interval. In general, since we do not always control all code that we use, it is good practice to start our timers at some random number of minutes past the hour, just to be safe.

Another example of this situation would be an image-processing service that periodically polls a set of directories looking for new data. When new images are found, they are copied to a staging area, renamed, scaled, and potentially converted to a common format before being archived for later use. If we’re not careful, it would be easy to overload the system if many images were to be uploaded at once.

A better approach would be to throttle our application (maybe using a queue-based architecture) so that it would only start an appropriately small number of image processors so as to not flood the system.

Common problems – software environments

Another common challenge is making sure that the software installed on all the various machines we are ever going to use is consistent and consistently upgraded.

Unfortunately, it is frustratingly common to spend hours debugging a distributed application only to discover that for some unknown and seemingly impossible reason, some computers had an old version of the code and/or its dependencies. Sometimes, we might even find the code to have disappeared completely.

The reasons for these discrepancies can be many: from a mount point that failed to a bug in our deployment procedures to a simple human mistake.

A common approach, especially in the HPC world, is to always create a self-contained environment for our code before launching the application itself. Some projects go as far as preferring static linking of all dependencies to avoid having the runtime pick up the wrong version of a dynamic library.

This approach works well if the application runtime is long compared to the time it takes to build its full environment, all of its software dependencies, and the application itself. It is not that practical otherwise.

Python, fortunately, has the ability to create self-contained virtual environments. There are two related tools that we can use: pyvenv (available as part of the Python 3.5 standard library) and virtualenv (available in PyPI). Additionally, pip, the Python package management system, allows us to specify the exact version of each package we want to install in a requirements file. These tools, when used together, permit reasonable control on the execution environment.

However, the devil, as it is often said, is in the details, and different computer nodes might use the exact same Python virtual environment but incompatible versions of some external library.

In this respect, container technologies such as Docker (https://www.docker.com) and, in general, version-controlled virtual machines are promising ways out of the software runtime environment maelstrom in those environments where they can be used.

In all other cases, HPC clusters come to mind, the best approach will probably be to not rely on the system software and manage our own environments and the full-software stack.

Common problems – permissions and environments

Different computers might have run our code under different user accounts, and our application might expect to be able to read a file or write data into a specific directory and hit an unexpected permission error. Even in cases where the user accounts used by our code are all the same (down to the same user ID and group ID), their environment may be different on different hosts. Therefore, an environment variable we assumed to be defined might not be or, even worse, might be set to an incompatible value.

These problems are common when our code runs as a special, unprivileged user such as nobody. Defensive coding, especially when accessing the environment, and making sure to always fall back to sensible defaults when variables are undefined (that is, value = os.environ.get(‘SOME_VAR’, fallback_value) instead of simply value = os.environ.get[‘SOME_VAR’]) is often necessary.

A common approach, when this is possible, is to only run our applications under a specific user account that we control and specify the full set of environment variables our code needs in the deployment and application startup scripts (which will have to be version controlled as well).

Some systems, however, not only execute jobs under extremely limited user accounts, but they also restrict code execution to temporary sandboxes. In many cases, access to the outside network is also blocked. In these situations, one might have no other choice but to set up the full environment locally and copy it to a shared disk partition. Other data can be served from custom-build servers running as ancillary jobs just for this purpose.

In general, permission problems and user environment mismatches are very similar to problems with the software environment and should be tackled in concert. Often times, developers find themselves wanting to isolate their code from the system as much as possible and create a small, but self-contained environment with all the code and all the environment variables they need.

Common problems – the availability of hardware resources

The hardware resources that our application needs might or might not be available at any given point in time. Moreover, even if some resources were to be available at some point in time, nothing guarantees that they will stay available for much longer. A problems we can face related to this are network glitches, which are quite common in many environments (especially for mobile apps) and which, for most practical purposes, are undistinguishable from machine or application crashes.

Applications using a distributed computing framework or job scheduler can often rely on the framework itself to handle at least some common failure scenarios. Some job schedulers will even resubmit our jobs in case of errors or sudden machine unavailability.

Complex applications, however, might need to implement their own strategies to deal with hardware failures. In some cases, the best strategy is to simply restart the application when the necessary resources are available again.

Other times, restarting from scratch would be cost prohibitive. In these cases, a common approach is to implement application checkpointing. What this means is that the application both writes its state to a disk periodically and is able to bootstrap itself from a previously saved state.

In implementing a checkpointing strategy, you need to balance the convenience of being able to restart an application midway with the performance hit of writing a state to a disk. Another consideration is the increase in code complexity, especially when many processes or threads are involved in reading and writing state information.

A good rule of thumb is that data or results that can be recreated easily and quickly do not warrant implementation of application checkpointing. If, on the other hand, some processing requires a significant amount of time and one cannot afford to waste it, then application checkpointing might be in order.

For example, climate simulations can easily run for several weeks or months at a time. In these cases, it is important to checkpoint them every hour or so, as restarting from the beginning after a crash would be expensive. On the other hand, a process that takes an uploaded image and creates a thumbnail for, say, a web gallery runs quickly and is not normally worth checkpointing.

To be safe, a state should always be written and updated automatically (for example, by writing to a temporary file and replacing the original only after the write completes successfully). The last thing we want is to restart from a corrupted state!

Very familiar to HPC users as well as users of AWS, a spot instance is a situation where a fraction or the entirety of the processes of our application are evicted from the machines that they are running on. When this happens, a warning is typically sent to our processes (usually, a SIGQUIT signal) and after a few seconds, they are unceremoniously killed (via a SIGKILL signal). For AWS spot instances, the time of termination is available through a web service in the instance metadata. In either case, our applications are given some time to save the state and quit in an orderly fashion.

Python has powerful facilities to catch and handle signals (refer to the signal module). For example, the following simple commands shows how we can implement a bare-bones checkpointing strategy in our application:

#!/usr/bin/env python3.5

"""

Simple example showing how to catch signals in Python

"""

import json

import os

import signal

import sys

 

 

# Path to the file we use to store state. Note that we assume

# $HOME to be defined, which is far from being an obvious

# assumption!

STATE_FILE = os.path.join(os.environ['HOME'],

                               '.checkpoint.json')

 

 

class Checkpointer:

    def __init__(self, state_path=STATE_FILE):

        """

        Read the state file, if present, and initialize from that.

        """

        self.state = {}

        self.state_path = state_path

        if os.path.exists(self.state_path):

            with open(self.state_path) as f:

                self.state.update(json.load(f))

        return

 

    def save(self):

        print('Saving state: {}'.format(self.state))

        with open(self.state_path, 'w') as f:

            json.dump(self.state, f)

        return

 

    def eviction_handler(self, signum, frame):

        """

        This is the function that gets called when a signal is trapped.

        """

        self.save()

 

        # Of course, using sys.exit is a bit brutal. We can do better.

        print('Quitting')

        sys.exit(0)

        return

 

 

if __name__ == '__main__':

    import time

 

    print('This is process {}'.format(os.getpid()))

 

    ckp = Checkpointer()

    print('Initial state: {}'.format(ckp.state))

 

    # Catch SIGQUIT

    signal.signal(signal.SIGQUIT, ckp.eviction_handler)

 

    # Get a value from the state.

    i = ckp.state.get('i', 0)

    try:

        while True:

            i += 1

            ckp.state['i'] = i

            print('Updated in-memory state: {}'.format(ckp.state))

            time.sleep(1)

    except KeyboardInterrupt:

        ckp.save()

If we run the preceding script in a terminal window and then in another terminal window, we send it a SIGQUIT signal (for example, via kill -s SIGQUIT <process id>). After this, we see the checkpointing in action, as the following screenshot illustrates:

A common situation in distributed applications is that of being forced to run code in potentially heterogeneous environments: machines (real or virtual) of different performances, with different hardware resources (for example, with or without GPUs), and potentially different software environments (as we mentioned already).

Even in the presence of a job scheduler, to help us choose the right software and hardware environment, we should always log the full environment as well as the performance of each execution machine. In advanced architectures, these performance metrics can be used to improve the efficiency of job scheduling.

PBS Pro, for instance, takes into consideration the historical performance figures of each job being submitted to decide where to execute it next. HTCondor continuously benchmarks each machine and makes those figures available for node selection and ranking.

Perhaps, the most frustrating cases are where either due to the network itself or due to servers being overloaded, network requests take so long that our code hits its internal timeouts. This might lead us to believe that the counterpart service is not available. These bugs, especially when transient, can be quite hard to debug.

Challenges – the development environment

Another common challenge in distributed systems is the setup of a representative development and testing environment, especially for individuals or small teams. Ideally, in fact, the development environment should be identical to the worst-case scenario deployment environment. It should allow developers to test common failure scenarios, such as a disk filling up, varying network latencies, intermittent network connections, hardware and software failures, and so on—all things that are bound to happen in real time, sooner or later.

Large teams have the resources to set up development and test clusters, and they almost always have dedicated software quality teams stress testing our code.

Small teams, unfortunately, often find themselves forced to write code on their laptops and use a very simplified (and best-case scenario!) environment made up by two or three virtual machines running on the laptops themselves to emulate the real system.

This pragmatic solution works and is definitely better than nothing. However, we should remember that virtual machines running on the same host exhibit unrealistically high-availability and low-network latencies. In addition, nobody will accidentally upgrade them without us knowing or image them with the wrong operating system. The environment is simply too controlled and stable to be realistic.

A step closer to a realistic setup would be to create a small development cluster on, say, AWS using the same VM images, with the same software stack and user accounts that we are going to use in production.

All things said, there is simply no replacement for the real thing. For cloud-based applications, it is worth our while to at least test our code on a smaller version of the deployment setup. For HPC applications, we should be using either a test cluster, a partition of the operational cluster, or a test queue for development and testing.

Ideally, we would develop on an exact clone of the operational system. Cost consideration and ease of development will constantly push us to the multiple-VMs-on-a-laptop solution; it is simple, essentially free, and it works without an Internet connection, which is an important point.

We should, however, keep in mind that distributed applications are not impossibly hard to write; they just have more failure modes than their monolithic counterparts do. Some of these failure modes (especially those related to data access patterns) typically require a careful choice of architecture.

Correcting architectural choices dictated by false assumptions later on in the development stage can be costly. Convincing managers to give us the hardware resources that we need early on is usually difficult. In the end, this is a delicate balancing act.

A useful strategy – logging everything

Often times, logging is like taking backups or eating vegetables—we all know we should do it, but most of us forget. In distributed applications, we simply have no other choice—logging is essential. Not only that, logging everything is essential.

With many different processes running on potentially ephemeral remote resources at difficult-to-predict times, the only way to understand what happens is to have logging information and have it readily available and in an easily searchable format/system.

At the bare minimum, we should log process startup and exit time, exit code and exceptions (if any), all input arguments, all outputs, the full execution environment, the name and IP of the execution host, the current working directory, the user account as well as the full application configuration, and all software versions.

The idea is that if something goes wrong, we should be able to use this information to log onto the same machine (if still available), go to the same directory, and reproduce exactly what our code was doing. Of course, being able to exactly reproduce the execution environment might simply not be possible (often times, because it requires administrator privileges).

However, we should always aim to be able to recreate a good approximation of that environment. This is where job schedulers really shine; they allow us to choose a specific machine and specify the full job environment, which makes replicating failures easier.

Logging software versions (not only the version of the Python interpreter, but also the version of all the packages used) helps diagnose outdated software stacks on remote machines. The Python package manager, pip, makes getting the list of installed packages easy: import pip; pip.main([‘list’]). Whereas, import sys; print(sys.executable, sys.version_info) displays the location and version of the interpreter.

It is also useful to create a system whereby all our classes and function calls emit logging messages with the same level detail and at the same points in the object life cycle. Common approaches involve the use of decorators and, maybe a bit too esoteric for some, metaclasses. This is exactly what the autologging Python module (available on PyPI) does for us.

Once logging is in place, we face the questions where to store all these logging messages and whose traffic could be substantial for high verbosity levels in large applications. Simple installations will probably want to write log messages to text files on a disk. More complex applications might want to store these messages in a database (which can be done by creating a custom handler for the Python logging module) or in specialized log aggregators such as Sentry (https://getsentry.com).

Closely related to logging is the issue of monitoring. Distributed applications can have many moving parts, and it is often essential to know which machines are up, which are busy, as well as which processes or jobs are currently running, waiting, or in an error state. Knowing which processes are taking longer than usual to complete their work is often an important warning sign that something might be wrong.

Several monitoring solutions for Python (often times, integrated with our logging system) exist. The Celery project, for instance, recommends flower (http://flower.readthedocs.org) as a monitoring and control web application. HPC job schedulers, on the other hand, tend to lack common, general-purpose, monitoring solutions that go beyond simple command-line clients.

Monitoring comes in handy in discovering potential problems before they become serious. It is in fact useful to monitor resources such as available disk space and trigger actions or even simple warning e-mails when they fall under a given threshold. Many centers monitor hardware performance and hard drive SMART data to detect early signs of potential problems.

These issues are more likely to be of interest to operations personnel rather than developers, but they are useful to keep in mind. They can also be integrated in our applications to implement strategies in order to handle performance degradations gracefully.

A useful strategy – simulating components

A good, although possibly expensive in terms of time and effort, test strategy is to simulate some or all of the components of our system. The reasons are multiple; on one hand, simulating or mocking software components allows us to test our interfaces to them more directly. In this respect, mock testing libraries, such as unittest.mock (part of the Python 3.5 standard library), are truly useful.

Another reason to simulate software components is to make them fail or misbehave on demand and see how our application responds. For instance, we could increase the response time of services such as REST APIs or databases to worst-case scenario levels and see what happens. Sometimes, we might exceed timeout values in some network calls leading our application to incorrectly assume that the sever has crashed.

Especially early on in the design and development of a complex distributed application, one can make overly optimistic assumptions about things such as network availability and performance or response time of services such as databases or web servers. For this reason, having the ability to either completely bring a service offline or, more subtly, modify its behavior can tell us a lot about which of the assumptions in our code might be overly optimistic.

The Netflix Chaos Monkey (https://github.com/Netflix/SimianArmy) approach of disabling random components of our system to see how our application copes with failures can be quite useful.

Summary

Writing and running small- or medium-sized distributed applications in Python is not hard. There are many high-quality frameworks that we can leverage among others, for example, Celery, Pyro, various job schedulers, Twisted, MPI bindings, or the multiprocessing module in the standard library.

The real difficulty, however, lies in monitoring and debugging our applications, especially because a large fraction of our code runs concurrently on many different, often remote, computers.

The most insidious bugs are those that end up producing incorrect results (for example, because of data becoming corrupted along the way) rather than raising an exception, which most frameworks are able to catch and bubble up.

The monitoring and debugging tools that we can use with Python code are, sadly, not as sophisticated as the frameworks and libraries we use to develop that same code. The consequence is that large teams end up developing their own, often times, very specialized distributed debugging systems from scratch and small teams mostly rely on log messages and print statements.

More work is needed in the area of debuggers for distributed applications in general and for dynamic languages such as Python in particular.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here