Originally coined in a 2004 research paper produced by Google, the term “MapReduce” was defined as a “programming model and an associated implementation for processing and generating large datasets”. While Google’s proprietary implementation of this model is known simply as “MapReduce”, the term has since become overloaded to refer to any software that follows the same general computation model.
The philosophy behind the MapReduce computation model is based on a divide and conquer approach: the input dataset is divided into small pieces and distributed among a pool of computation nodes for processing. The advantage here is that many nodes can run in parallel to solve a problem. This can be much quicker than a single machine, provided that the cost of communication (loading input data, relaying intermediate results between compute nodes, and writing final results to persistent storage) does not outweigh the computation time. Indeed, the cost of moving data between storage and computation nodes can diminish the advantage of distributed parallel computation if task payloads are too granular. On the other hand, they must be granular enough in order to be distributed evenly (although achieving perfect distribution is nearly impossible if exact computational complexity of each task cannot be known in advance).
A distributed MapReduce solution can be effective for processing large data sets, provided that each task is sufficiently long-running to offset communication costs. If the need for data transfer (between compute and storage nodes) could somehow be reduced or eliminated, the task size limitations would all but disappear, enabling a wider range of use cases. One way to achieve this is to perform the computation closer to the data, that is, on the same physical machine. Stored procedures in relational database systems, for example, are one way to achieve data-local computation. Simpler yet, computation could be run directly on the system where the data is stored in files. In both cases, the same problems are present: changes to the code require direct access to the system, and thus access needs to be carefully controlled (through group management, read/write permissions, and so on). Scaling in these scenarios is also problematic: vertical scaling (beefing up a single server) is the only feasible way to gain performance while maintaining the efficiency of data-local computation. Achieving horizontal scalability by adding more machines is not feasible unless the storage system inherently supports this.
Enter Swift and ZeroVM
OpenStack Swift is one such example of a horizontally scalable data store. Swift can store petabytes of data in millions of objects across thousands of machines. Swift’s storage operations can also be extended by writing custom middleware applications, which makes it a good foundation for building a “smart” storage system capable of running computations directly on the data.
ZeroCloud is a middleware application for Swift which does just that. It embeds ZeroVM inside Swift and exposes a new API method for executing jobs. ZeroVM provides sufficient program isolation and a guarantee of security such that any arbitrary code can be run safely without compromising the storage system. Access to data is governed by Swift’s inherent access control, so there is no need to further lock down the storage system, even in a multi-tenant environment.
The combination of Swift and ZeroVM results in something like stored procedures, but much more powerful. First and foremost, user application code can be updated at any time without the need to worry about malicious or otherwise destructive side effects. (The worst thing that can happen is that a user can crash their own machines and delete their own data—but not another user’s.) Second, ZeroVM instances can start very quickly: in about five milliseconds. This means that to process one thousand objects, ZeroCloud can instantaneously spawn one thousand ZeroVM instances (one for each file), run the job, and destroy the instances very quickly. This also means that any job, big or small, can feasibly run on the system. Finally, the networking component of ZeroVM allows intercommunication between instances, enabling the creation of arbitrary job pipelines and multi-stage computations.
This converged compute and storage solution makes for a good alternative to popular MapReduce frameworks like Hadoop because little effort is required to set up and tear down computation nodes for a job. Also, once a job is complete, there is no need to extract result data, pipe it across the network, and then save it in a separate persistent data store (which can be an expensive operation if there is a large quantity of data to save). With ZeroCloud, the results can simply be saved back into Swift. The same principle applies to the setup of a job. There is no need to move or copy data to the computation cluster; the data is already where it needs to be.
Currently, ZeroCloud uses a fairly naïve scheduling algorithm. To perform a data-local computation on an object, ZeroCloud determines which machines in the cluster contain a replicated copy of an object and then randomly chooses one to execute a ZeroVM process. While this functions well as a proof-of-concept for data-local computing, it is quite possible for jobs to be spread unevenly across a cluster, resulting in an inefficient use of resources.
A research group at the University of Texas at San Antonio (UTSA) sponsored by Rackspace is currently working on developing better algorithms for scheduling workloads running on ZeroVM-based infrastructure. A short video of the research proposal can be found here:
Rackspace has launched a ZeroCloud playground service called Zebra for developers to try out the platform. At the time this was written, Zebra is still in a private beta and invitations are limited. But it also possible to install and run your own copy of ZeroCloud for testing and development; basic installation instructions are here: https://github.com/zerovm/zerocloud/blob/icehouse/doc/Hacking.md. There are also some tutorials (including sample applications) for creating, packaging, deploying, and executing applications on ZeroCloud: http://zerovm.readthedocs.org/en/latest/zebra/tutorial.html. The tutorials are intended for use with the Zebra service, but can be run on any deployment of ZeroCloud.
Big Data in the cloud? It’s not the future, it’s already here! Find more Hadoop tutorials and extra content here, and dive deeper into OpenStack by visiting this page, dedicated to one of the most exciting cloud platforms around today.
About the author
Lars Butler is a software developer at Rackspace, the open cloud company. He has worked as a software developer in avionics, seismic hazard research, and most recently, on ZeroVM. He can be reached @larsbutler.