Cloud computing has transformed the way individuals and organizations access and manage their servers and applications on the internet. Before Cloud computing, everyone used to manage their servers and applications on their own premises or on dedicated data centers. The increase in the raw computing power of computing (CPU and GPU) of multiple-cores on a single chip and the increase in the storage space (HDD and SSD) present challenges in efficiently utilizing the available computing resources.
In today’s tutorial, we will learn different ways of building Hadoop cluster on the Cloud and ways to store and access data on Cloud.
This article is an excerpt from a book written by Naresh Kumar and Prashant Shindgikar titled Modern Big Data Processing with Hadoop.
Building Hadoop cluster in the Cloud
Cloud offers a flexible and easy way to rent resources such as servers, storage, networking, and so on. The Cloud has made it very easy for consumers with the pay-as-you-go model, but much of the complexity of the Cloud is hidden from us by the providers.
In order to better understand whether Hadoop is well suited to being on the Cloud, let’s try to dig further and see how the Cloud is organized internally.
At the core of the Cloud are the following mechanisms:
- A very large number of servers with a variety of hardware configurations
- Servers connected and made available over IP networks
- Large data centers to host these devices
- Data centers spanning geographies with evolved network and data center designs
If we pay close attention, we are talking about the following:
- A very large number of different CPU architectures
- A large number of storage devices with a variety of speeds and performance
- Networks with varying speed and interconnectivity
Let’s look at a simple design of such a data center on the Cloud:We have the following devices in the preceding diagram:
- S1, S2: Rack switches
- U1-U6: Rack servers
- R1: Router
- Storage area network
- Network attached storage
As we can see, Cloud providers have a very large number of such architectures to make them scalable and flexible.
You would have rightly guessed that when the number of such servers increases and when we request a new server, the provider can allocate the server anywhere in the region.
This makes it a bit challenging for compute and storage to be together but also provides elasticity.
In order to address this co-location problem, some Cloud providers give the option of creating a virtual network and taking dedicated servers, and then allocating all their virtual nodes on these servers. This is somewhat closer to a data center design, but flexible enough to return resources when not needed.
Let’s get back to Hadoop and remind ourselves that in order to get the best from the Hadoop system, we should have the CPU power closer to the storage. This means that the physical distance between the CPU and the storage should be much less, as the BUS speeds match the processing requirements.
The slower the I/O speed between the CPU and the storage (for example, iSCSI, storage area network, network attached storage, and so on) the poorer the performance we get from the Hadoop system, as the data is being fetched over the network, kept in memory, and then fed to the CPU for further processing.
This is one of the important things to keep in mind when designing Hadoop systems on the Cloud.
Apart from performance reasons, there are other things to consider:
- Scaling Hadoop
- Managing Hadoop
- Securing Hadoop
Now, let’s try to understand how we can take care of these in the Cloud environment.
Hadoop can be installed by the following methods:
When we want to deploy Hadoop on the Cloud, we can deploy it using the following ways:
- Custom shell scripts
- Cloud automation tools (Chef, Ansible, and so on)
- Apache Ambari
- Cloud vendor provided methods
- Google Cloud Dataproc
- Amazon EMR
- Microsoft HDInsight
- Third-party managed Hadoop
- Cloud agnostic deployment
- Apache Whirr
Google Cloud Dataproc
In this section, we will learn how to use Google Cloud Dataproc to set up a single node Hadoop cluster.
The steps can be broken down into the following:
- Getting a Google Cloud account.
- Activating Google Cloud Dataproc service.
- Creating a new Hadoop cluster.
- Logging in to the Hadoop cluster.
- Deleting the Hadoop cluster.
Getting a Google Cloud account
This section assumes that you already have a Google Cloud account.
Activating the Google Cloud Dataproc service
Once you log in to the Google Cloud console, you need to visit the Cloud Dataproc service. The activation screen looks something like this:
Creating a new Hadoop cluster
Once the Dataproc is enabled in the project, we can click on Create to create a new Hadoop cluster.
After this, we see another screen where we need to configure the cluster parameters:
I have left most of the things to their default values. Later, we can click on the Create button which creates a new cluster for us.
Logging in to the cluster
After the cluster has successfully been created, we will automatically be taken to the cluster lists page. From there, we can launch an SSH window to log in to the single node cluster we have created.
The SSH window looks something like this:
As you can see, the Hadoop command is readily available for us and we can run any of the standard Hadoop commands to interact with the system.
Deleting the cluster
In order to delete the cluster, click on the DELETE button and it will display a confirmation window, as shown in the following screenshot. After this, the cluster will be deleted:
Looks so simple, right? Yes. Cloud providers have made it very simple for users to use the Cloud and pay only for the usage.
Data access in the Cloud
The Cloud has become an important destination for storing both personal data and business data. Depending upon the importance and the secrecy requirements of the data, organizations have started using the Cloud to store their vital datasets.
The following diagram tries to summarize the various access patterns of typical enterprises and how they leverage the Cloud to store their data:
Cloud providers offer different varieties of storage. Let’s take a look at what these types are:
- Block storage
- File-based storage
- Encrypted storage
- Offline storage
This type of storage is primarily useful when we want to use this along with our compute servers, and want to manage the storage via the host operating system.
To understand this better, this type of storage is equivalent to the hard disk/SSD that comes with our laptops/MacBook when we purchase them. In case of laptop storage, if we decide to increase the capacity, we need to replace the existing disk with another one.
When it comes to the Cloud, if we want to add more capacity, we can just purchase another larger capacity storage and attach it to our server. This is one of the reasons why the Cloud has become popular as it has made it very easy to add or shrink the storage that we need.
It’s good to remember that, since there are many different types of access patterns for our applications, Cloud vendors also offer block storage with varying storage/speed requirements measured with their own capacity/IOPS, and so on.
Let’s take an example of this capacity upgrade requirement and see what we do to utilize this block storage on the Cloud.
In order to understand this, let’s look at the example in this diagram:
Imagine a server created by the administrator called DB1 with an original capacity of 100 GB. Later, due to unexpected demand from customers, an application started consuming all the 100 GB of storage, so the administrator has decided to increase the capacity to 1 TB (1,024 GB).
This is what the workflow looks like in this scenario:
- Create a new 1 TB disk on the Cloud
- Attach the disk to the server and mount it
- Take a backup of the database
- Copy the data from the existing disk to the new disk
- Start the database
- Verify the database
- Destroy the data on the old disk and return the disk
This process is simplified but in production this might take some time, depending upon the type of maintenance that is being performed by the administrator. But, from the Cloud perspective, acquiring new block storage is very quick.
Files are the basics of computing. If you are familiar with UNIX/Linux environments, you already know that, everything is a file in the Unix world. But don’t get confused with that as every operating system has its own way of dealing with hardware resources. In this case we are not worried about how the operating system deals with hardware resources, but we are talking about the important documents that the users store as part of their day-to-day business.
These files can be:
- Movie/conference recordings
- Excel sheets
- Word documents
Even though they are simple-looking files in our computer, they can have significant business importance and should be dealt with in a careful fashion, when we think of storing these on the Cloud.
Most Cloud providers offer an easy way to store these simple files on the Cloud and also offer flexibility in terms of security as well.
A typical workflow for acquiring the storage of this form is like this:
- Create a new storage bucket that’s uniquely identified
- Add private/public visibility to this bucket
- Add multi-geography replication requirement to the data that is stored in this bucket
Some Cloud providers bill their customers based on the number of features they select as part of their bucket creation.
Please choose a hard-to-discover name for buckets that contain confidential data, and also make them private.
This is a very important requirement for business critical data as we do not want the information to be leaked outside the scope of the organization. Cloud providers offer an encryption at rest facility for us. Some vendors choose to do this automatically and some vendors also provide flexibility in letting us choose the encryption keys and methodology for the encrypting/decrypting data that we own. Depending upon the organization policy, we should follow best practices in dealing with this on the Cloud.
With the increase in the performance of storage devices, encryption does not add significant overhead while decrypting/encrypting files. This is depicted in the following image:
Continuing the same example as before, when we choose to encrypt the underlying block storage of 1 TB, we can leverage the Cloud-offered encryption where they automatically encrypt and decrypt the data for us. So, we do not have to employ special software on the host operating system to do the encryption and decryption.
Remember that encryption can be a feature that’s available in both the block storage and file-based storage offer from the vendor.
This storage is very useful for storing important backups in the Cloud that are rarely accessed. Since we are dealing with a special type of data here, we should also be aware that the Cloud vendor might charge significantly high amounts for data access from this storage, as it’s meant to be written once and forgetten (until it’s needed). The advantage with this mechanism is that we have to pay lesser amounts to store even petabytes of data.
We looked at the different steps involved in building our own Hadoop cluster on the Cloud. And we saw different ways of storing and accessing our data on the Cloud.
To know more about how to build expert Big Data systems, do checkout this book Modern Big Data Processing with Hadoop.