4 min read

This post has two parts. In the first part I introduced Ryba, its goals and how to install and start using it. Ryba boostrap and manages a full secured Hadoop cluster with one command. In this second part we learn why Ryba was written, how it is multi-tenancy addressed, and how it targets the user.

Hadoop

At its heart, Hadoop delivers a file system, HDFS, and a resource management solution called YARN. In some ways, we might consider Hadoop as an operating system running at the cluster level.

Users and applications running inside the system must securely access the data and expect a high quality of service. Over time, more and more projects join the party and usages get more complex.

Ryba

Ryba started in January 2013 for the needs of EDF, one of the largest utility companies worldwide. The IT department wanted to deploy a Hadoop cluster shared by multiple users inside the company, a so-called multi-tenant architecture. Multi-tenancy refers to a system where each user and application receive a dedicated share of the available resources. The key aspects of those systems are:

  • Security: since resources are shared, authorization, fine-grained authentication as well as data encryption are crucial.
  • Resource manager: users expect a minimal guaranty of resource availability based memory, CPU, disks and network usage.
  • Delivery of services: most if not all the applications must be highly available, the system and its applications must be instrumented with metric reporting and alert notification.
  • Diversity: usages are diverse, including long term and secure archiving, disk intensive batch analytics and CPU intensive real-time processing. At the same time, the constraints imposed to users shall be minimal.
  • Governance: procedures and policies are emerging based on usages, requirements and experience including data lifecycle, user management, fine-grained authorization deployment, migration between versions.

Early 2013 was a time when the Cloudera Manager was in its infancy and Ambari wasn’t alive. It was also a time when Kerberos and High Availability were not yet supported. In some aspect, targeting multi-tenancy with Hadoop was a bet on the near future.

To achieve such goal, a pragmatic approach was required. A limited selection of our project started to be integrated. The decision was mainly influenced by our familiarity with each project of the ecosystem, their maturity and the early usages of the cluster. The focus was on the Hadoop core including ZooKeeper, HDFS and MapReduce, as well as traditional and often batch-oriented projects such as Hive, Pig, and Mahout. Then came the integration with Oozie, Flume and HBase.

Considering Hadoop’s multi-tenancy, the year 2013 was a major shift with the integration of Kerberos to secure the cluster as well as the arrival of YARN to centralize resource management.

We strongly believed that security should be a common commodity for any Hadoop cluster.

Kerberos is enabled by default. To say the least, putting Kerberos at every level of Hadoop cluster was not an easy task and certainly not the kind of thing for which you want to be one of the first to do. Before Hadoop could use Kerberos, we spent time to learn how Kerberos, LDAP and the RedHat operating system with SSSD would work together. Ryba can take care of generating, managing and deploying SSL certificates. The web servers are started in HTTPS mode and on the wire data encryption can be enabled. Firewall rules are also documented and activated if iptables is started.

Ryba: More Than Hadoop

Ryba is a deployment tool not limited to Hadoop. It also configures the OS, installs an OpenLDAP and Kerberos server, and interfaces with your enterprise environment such as an Active Directory. It runs from any host with an SSH connection to the cluster. This could be a developer laptop, a node inside the cluster, or a dedicated server. It uses the SSH transport protocol to connect to a freshly installed cluster and bootstraps it into an operating cluster.

The overall system is quite simple to get started with. From an early stage, Ryba embraces idempotence by design. If you are not familiar with the term, it refers to operations that produce the same effects if run multiple times. An operator may safely use the command ryba install to configure, start and check the system. A developer may change a configuration setting or write an extension in JavaScript and directly run the same command.

To this extent, Ryba is a DevOps tool. It is flexible enough to be enriched with your own customization and new features, and quick enough to deploy fixes and enhancements within minutes.

About this author

The author, David Worms, is the owner of Adaltas, a French company based in Paris and specialized in the deployment of secure Hadoop clusters.

LEAVE A REPLY

Please enter your comment!
Please enter your name here