4 min read

Big data was once one of the biggest technology hypes, where tons of presentations and posts talked about how the new systems and tools allows large and complex data to be processed that traditional tools wasn’t able to. While Big data was at the peak of its hype, most companies were still getting familiar with the new data processing frameworks such as Hadoop, and new databases such as HBase and Cassandra. Fast foward to now where Big data is still a popular topic, and lots of companies has already jumped into the Big data bandwagon and are already moving past the first generation Hadoop to evaluate newer tools such as Spark and newer databases such as Firebase, NuoDB or Memsql. But most companies also learn from running all of these tools, that deploying, operating and planning capacity for these tools is very hard and complicated. Although over time lots of these tools have become more mature, they are still usually running in their own independent clusters. It’s also not rare to find multiple clusters of Hadoop in the same company since multi-tenant isn’t built in to many of these tools, and you run the risk of overloading the cluster by a few non-critical big data jobs.

Problems running indepdent Big data clusters

There are a lot of problems when you run a lot of these independent clusters. One of them is monitoring and visibility, where all of these clusters have their own management tools and to integrate the company’s shared monitoring and management tools is a huge challenge especially when onboarding yet another framework with another cluster. Another problem is multi-tenancy. Although having independent clusters solves the problem, another org’s job can overtake the whole cluster. It still doesn’t solve the problem when a bug in the Hadoop application just uses all the available resources and the pain of debugging this is horrific. A another problem is utilization, where a cluster is usually not 100% being utilized and all of these instances running in Amazon or in your datacenter are just racking up bills for doing no work. There are more major pain points that I don’t have time to get into.

Hadoop v2

The Hadoop developers and operators saw this problem, and in the 2nd generation of Hadoop they developed a separate resource management tool called YARN to have a single management framework that manages all of the resources in the cluster from Hadoop, enforce the resource limitations of the jobs, integrate security in the workload, and even optimize the workload by placing jobs closer to the data automatically. This solves a huge problem when operating a Hadoop cluster, and also consolidates all of the Hadoop clusters into one cluster since it allows a finer grain control over the workload and saves effiency of the cluster.

Beyond Hadoop

Now with the vast amount of Big data technologies that are growing in the ecosystem, there is a need to integrate a common resource management layer among all of the tools since without a single resource management system across all the frameworks we run back into the same problems as we mentioned before. Also when all these frameworks are running under one resource management platform, a lot of options for optimizations and resource scheduling are now possible.

Here are some examples what could be possible with one resource management platform: With one resource management platform the platform can understand all of the cluster workload and available resources and can auto resize and scale up and down based on worklaods across all these tools. It can also resize jobs according to priority.

The cluster is able to detect under utilization from other jobs and offer the slack resources to Spark batch jobs while not impacting your very important workloads from other frameworks, and maintain the same business deadlines and save a lot more cost.

In the next post I’ll continue to cover Mesos, which is one such resource management system and how the upcoming features in Mesos allows optimizations I mentioned to be possible.

For more Big Data tutorials and analysis, visit our dedicated Hadoop and Spark pages.

About the author

Timothy Chen is a distributed systems engineer and entrepreneur. He works at Mesosphere and can be found on Github @tnachen.

LEAVE A REPLY

Please enter your comment!
Please enter your name here