In this article by Dipa Dubhasi and Akhil Das authors of the book Mastering Mesos, delves into understanding the importance of Mesos.
Apache Mesos is an open source, distributed cluster management software that came out of AMPLab, UC Berkeley in 2011. It abstracts CPU, memory, storage, and other computer resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. It is referred to as a metascheduler (scheduler of schedulers) and a “distributed systems kernel/distributed datacenter OS”.
It improves resource utilization, simplifies system administration, and supports a wide variety of distributed applications that can be deployed by leveraging its pluggable architecture. It is scalable and efficient and provides a host of features, such as resource isolation and high availability, which, along with a strong and vibrant open source community, makes this one of the most exciting projects.
(For more resources related to this topic, see here.)
Introduction to the datacenter OS and architecture of Mesos
Over the past decade, datacenters have graduated from packing multiple applications into a single server box to having large datacenters that aggregate thousands of servers to serve as a massively distributed computing infrastructure. With the advent of virtualization, microservices, cluster computing, and hyper-scale infrastructure, the need of the hour is the creation of an application-centric enterprise that follows a software-defined datacenter strategy.
Currently, server clusters are predominantly managed individually, which can be likened to having multiple operating systems on the PC, one each for processor, disk drive, and so on. With an abstraction model that treats these machines as individual entities being managed in isolation, the ability of the datacenter to effectively build and run distributed applications is greatly reduced.
Another way of looking at the situation is comparing running applications in a datacenter to running them on a laptop. One major difference is that while launching a text editor or web browser, we are not required to check which memory modules are free and choose ones that suit our need. Herein lies the significance of a platform that acts like a host operating system and allows multiple users to run multiple applications simultaneously by utilizing a shared set of resources.
Datacenters now run varied distributed application workloads, such as Spark, Hadoop, and so on, and need the capability to intelligently match resources and applications. The datacenter ecosystem today has to be equipped to manage and monitor resources and efficiently distribute workloads across a unified pool of resources with the agility and ease to cater to a diverse user base (noninfrastructure teams included). A datacenter OS brings to the table a comprehensive and sustainable approach to resource management and monitoring. This not only reduces the cost of ownership but also allows a flexible handling of resource requirements in a manner that isolated datacenter infrastructure cannot support.
The idea behind a datacenter OS is that of an intelligent software that sits above all the hardware in a datacenter and ensures efficient and dynamic resource sharing. Added to this is the capability to constantly monitor resource usage and improve workload and infrastructure management in a seamless way that is not tied to specific application requirements. In its absence, we have a scenario with silos in a datacenter that force developers to build software catering to machine-specific characteristics and make the moving and resizing of applications a highly cumbersome procedure.
The datacenter OS acts as a software layer that aggregates all servers in a datacenter into one giant supercomputer to deliver the benefits of multilatency, isolation, and resource control across all microservice applications. Another major advantage is the elimination of human-induced error during the continual assigning and reassigning of virtual resources.
From a developer’s perspective, this will allow them to easily and safely build distributed applications without restricting them to a bunch of specialized tools, each catering to a specific set of requirements. For instance, let’s consider the case of Data Science teams who develop analytic applications that are highly resource intensive. An operating system that can simplify how the resources are accessed, shared, and distributed successfully alleviates their concern about reallocating hardware every time the workloads change.
Of key importance is the relevance of the datacenter OS to DevOps, primarily a software development approach that emphasizes automation, integration, collaboration, and communication between traditional software developers and other IT professionals. With a datacenter OS that effectively transforms individual servers into a pool of resources, DevOps teams can focus on accelerating development and not continuously worry about infrastructure issues.
In a world where distributed computing becomes the norm, the datacenter OS is a boon in disguise. With freedom from manually configuring and maintaining individual machines and applications, system engineers need not configure specific machines for specific applications as all applications would be capable of running on any available resources from any machine, even if there are other applications already running on them. Using a datacenter OS results in centralized control and smart utilization of resources that eliminate hardware and software silos to ensure greater accessibility and usability even for noninfrastructural professionals.
Examples of some organizations administering their hyperscale datacenters via the datacenter OS are Google with the Borg (and next geneneration Omega) systems. The merits of the datacenter OS are undeniable, with benefits ranging from the scalability of computing resources and flexibility to support data sharing across applications to saving team effort, time, and money while launching and managing interoperable cluster applications.
It is this vision of transforming the datacenter into a single supercomputer that Apache Mesos seeks to achieve. Born out of a Berkeley AMPLab research paper in 2011, it has since come a long way with a number of leading companies, such as Apple, Twitter, Netflix, and AirBnB among others, using it in production. Mesosphere is a start-up that is developing a distributed OS product with Mesos at its core.
The architecture of Mesos
Mesos is an open-source platform for sharing clusters of commodity servers between different distributed applications (or frameworks), such as Hadoop, Spark, and Kafka among others. The idea is to act as a centralized cluster manager by pooling together all the physical resources of the cluster and making it available as a single reservoir of highly available resources for all the different frameworks to utilize. For example, if an organization has one 10-node cluster (16 CPUs and 64 GB RAM) and another 5-node cluster (4 CPUs and 16 GB RAM), then Mesos can be leveraged to pool them into one virtual cluster of 720 GB RAM and 180 CPUs, where multiple distributed applications can be run. Sharing resources in this fashion greatly improves cluster utilization and eliminates the need for an expensive data replication process per-framework.
Some of the important features of Mesos are:
- Scalability: It can elastically scale to over 50,000 nodes
- Resource isolation: This is achieved through Linux/Docker containers
- Efficiency: This is achieved through CPU and memory-aware resource scheduling across multiple frameworks
- High availability: This is through Apache ZooKeeper
- Interface: A web UI for monitoring the cluster state
Mesos is based on the same principles as the Linux kernel and aims to provide a highly available, scalable, and fault-tolerant base for enabling various frameworks to share cluster resources effectively and in isolation. Distributed applications are varied and continuously evolving, a fact that leads Mesos’ design philosophy towards a thin interface that allows an efficient resource allocation between different frameworks and delegates the task of scheduling and job execution to the frameworks themselves. The two advantages of doing so are:
Different frame data replication works can independently devise methods to address their data locality, fault-tolerance, and other such needs.
- It simplifies the Mesos codebase and allows it to be scalable, flexible, robust, and agile
Mesos’ architecture hands over the responsibility of scheduling tasks to the respective frameworks by employing a resource offer abstraction that packages a set of resources and makes offers to each framework. The Mesos master node decides the quantity of resources to offer each framework, while each framework decides which resource offers to accept and which tasks to execute on these accepted resources. This method of resource allocation is shown to achieve good degree of data locality for each framework sharing the same cluster.
An alternative architecture would implement a global scheduler that took framework requirements, organizational priorities, and resource availability as inputs and provided a task schedule breakdown by framework and resource as output, essentially acting as a matchmaker for jobs and resources with priorities acting as constraints. The challenges with this architecture, such as developing a robust API that could capture all the varied requirements of different frameworks, anticipating new frameworks, and solving a complex scheduling problem for millions of jobs, made the former approach a much more attractive option for the creators.
Thus in this article, we introduced Mesos, and then dived deep into its architecture to understand importance of Mesos.
Resources for Article:
- Understanding Mesos Internals [article]
- Leveraging Python in the World of Big Data [article]
- Self-service Business Intelligence, Creating Value from Data [article]