This post has two parts. In the first part I introduced Ryba, its goals and how to install and start using it. Ryba boostrap and manages a full secured Hadoop cluster with one command. In this second part we learn why Ryba was written, how it is multi-tenancy addressed, and how it targets the user.
At its heart, Hadoop delivers a file system, HDFS, and a resource management solution called YARN. In some ways, we might consider Hadoop as an operating system running at the cluster level.
Users and applications running inside the system must securely access the data and expect a high quality of service. Over time, more and more projects join the party and usages get more complex.
Ryba started in January 2013 for the needs of EDF, one of the largest utility companies worldwide. The IT department wanted to deploy a Hadoop cluster shared by multiple users inside the company, a so-called multi-tenant architecture. Multi-tenancy refers to a system where each user and application receive a dedicated share of the available resources. The key aspects of those systems are:
Early 2013 was a time when the Cloudera Manager was in its infancy and Ambari wasn’t alive. It was also a time when Kerberos and High Availability were not yet supported. In some aspect, targeting multi-tenancy with Hadoop was a bet on the near future.
To achieve such goal, a pragmatic approach was required. A limited selection of our project started to be integrated. The decision was mainly influenced by our familiarity with each project of the ecosystem, their maturity and the early usages of the cluster. The focus was on the Hadoop core including ZooKeeper, HDFS and MapReduce, as well as traditional and often batch-oriented projects such as Hive, Pig, and Mahout. Then came the integration with Oozie, Flume and HBase.
Considering Hadoop’s multi-tenancy, the year 2013 was a major shift with the integration of Kerberos to secure the cluster as well as the arrival of YARN to centralize resource management.
We strongly believed that security should be a common commodity for any Hadoop cluster.
Kerberos is enabled by default. To say the least, putting Kerberos at every level of Hadoop cluster was not an easy task and certainly not the kind of thing for which you want to be one of the first to do. Before Hadoop could use Kerberos, we spent time to learn how Kerberos, LDAP and the RedHat operating system with SSSD would work together. Ryba can take care of generating, managing and deploying SSL certificates. The web servers are started in HTTPS mode and on the wire data encryption can be enabled. Firewall rules are also documented and activated if iptables is started.
Ryba is a deployment tool not limited to Hadoop. It also configures the OS, installs an OpenLDAP and Kerberos server, and interfaces with your enterprise environment such as an Active Directory. It runs from any host with an SSH connection to the cluster. This could be a developer laptop, a node inside the cluster, or a dedicated server. It uses the SSH transport protocol to connect to a freshly installed cluster and bootstraps it into an operating cluster.
The overall system is quite simple to get started with. From an early stage, Ryba embraces idempotence by design. If you are not familiar with the term, it refers to operations that produce the same effects if run multiple times. An operator may safely use the command ryba install to configure, start and check the system. A developer may change a configuration setting or write an extension in JavaScript and directly run the same command.
To this extent, Ryba is a DevOps tool. It is flexible enough to be enriched with your own customization and new features, and quick enough to deploy fixes and enhancements within minutes.
The author, David Worms, is the owner of Adaltas, a French company based in Paris and specialized in the deployment of secure Hadoop clusters.
I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…
Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…
Once we learn how to deploy an Ubuntu server, how to manage users, and how…
Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…
While developing a web application, or setting dynamic pages and meta tags we need to deal with…
Software architecture is one of the most discussed topics in the software industry today, and…