How to Run Hadoop on Google Cloud

Setting up and working with Hadoop can sometimes be difficult. Furthermore, most people with limited resources develop on Hadoop instances on Virtual Machines locally or on minimal hardware. The problem with this is that Hadoop is really designed to run on many machines in order to realize its full capabilities.

In this two part series of posts (read part 1 here), we will show you how you can quickly get started with Hadoop in the cloud with Google services. In the last part in this series, we installed our Google developer account. Now it is time to install the Google Cloud SDK.

Installing the Google Cloud SDK

To work with the Google Cloud SDK, we need a Cygwin 32-bit version. Get it here, even if you have a 64-bit processor. The reason for this is that the Python 64-bit version for Windows has issues that make it incompatible with many common Python tools. So you should stick with the 32-bit version.

Next, when you install Cygwin, you need to make sure you select Python (note that if you do not install the Cygwin version of Python, your installation will fail), openssh, and curl. You can do this when you get to the package screen by typing openssh or curl in the search bar at top and selecting the package under "net," then by selecting the check box under "Bin" for openssh. Do the same for curl. You should see something like what is shown in Figures 1 and 2 respectively.

how-to-run-hadoop-on-google-cloud-2-img-0

Figure 1: Adding openssh

how-to-run-hadoop-on-google-cloud-2-img-1

Figure 2: Adding curl to Cygwin

Now go ahead and start Cygwin by going to Start -> All Programs -> Cygwin -> Cygwin Terminal.

Now use curl to install the Google Cloud SDK by typing the following command “$ curl https://sdk.cloud.google.com | bash,” which will install the Google Cloud SDK from the Internet.

Follow the prompts to complete the setup. When prompted, if you would like to update your system path, select "y" and when complete, restart Cygwin.

After you restart Cygwin, you need to authenticate with the Google Cloud SDK. To do this type "gcloud auth login –no-launch-browser" like in Figure 3.

how-to-run-hadoop-on-google-cloud-2-img-2

Figure 3: Authenticating with Google Cloud SDK tools

Cloud SDK will then give you a URL that you should copy and paste in your browser. You will then be asked to log in with your Google account and accept the permissions requested by the SDK as in Figure 4.

how-to-run-hadoop-on-google-cloud-2-img-3

Figure 4: Google Cloud authorization via OAuth

Google will provide you with a verification code that you can cut and paste into the command line and if everything works, you should be logged in.

Next, set your project ID for this session by using the command "$ gcloud config set project YOURPROJECTID" as in Figure 5.

how-to-run-hadoop-on-google-cloud-2-img-4

Figure 5: Setting your project ID

Now you need to download the set of scripts that will help you set up Hadoop in Google Cloud Storage.[1] Make sure you do not close this command-line window because we are going to use it again. Download the Big Data utilities scripts to set up Hadoop in the Cloud here. Once you have downloaded the zip, unpack it and place it in the directory wherever you want.

Now, in the command line, type "gsutil mb -p YOURPROJECTID gs://SOMEBUCKETNAME."

If everything goes well, you should see something like Figure 6.

how-to-run-hadoop-on-google-cloud-2-img-5

Figure 6: Creating your Google Cloud Storage bucker

YOURPROJECTID is the project ID you created or were assigned earlier and SOMEBUCKETNAME is whatever you want your bucket to be called. Unfortunately, bucket names must be unique. Read more here, so using something like your company domain name and some other unique identifier might be a good idea. If you do not pick a unique name, you will get an error.

Now go to the directory where you stored your Big Data Utility Scripts and open bdutil_env.sh in a text editor as in Figure 7.

how-to-run-hadoop-on-google-cloud-2-img-6

Figure 7: Editing the bdutil_env.sh file

Now add your bucket name for the CONFIGBUCKET value in the file and your project ID for the PROJECT value like in Figure 8. Now save the file.

how-to-run-hadoop-on-google-cloud-2-img-7

Figure 8: Editing the bdutil_env.sh file

Once you have the bdutil_env.sh file, you need to test that you can reach your compute instances via gcutil and ssh. Let’s walk through that now to set it up so you can do it in the future.

In Cygwin, create a test instance to play with and set up gcutil by typing the command "gcutil addinstance mytest," then hit Enter. You will be asked to select a time zone (I selected 6), a number of processors, and the like. Go ahead and select the items you want since after we create this instance and connect to it, we will delete it. After you walk through the setup steps, Google will create your instance. During the creation, you will be asked for a passphrase. Make sure you use a passphrase you can remember. Now, in the command line, type "gcutil ssh mytest." This will now try to connect to your "mytest" instance via SSH, and if it’s the first time you have done this, you will be asked to type in a passphrase. Do not type a passphrase; just leave it blank and select Enter. This will then create a public and private ssh key. If everything works, you should now connect to the instance and you will know gcutil ssh is working correct. Go ahead and type "exit" and then "gcutil deleteinstance mytest" and select "y" for all questions. This will trigger the Google Cloud to destroy your test instance.

Now in Cygwin, navigate to where you placed the dbutils download. If you are not familiar with Cygwin, you can navigate to any directory on the c drive by using the "cygdrive/c" and then set the Unix style path to your directory. So, for example, on my computer it would look like Figure 9.

Figure 9: Navigating to the dbutils folder in Cygwin

Now we can attempt a deployment of Haddop by typing "./bdutil deploy" like in Figure 10.

Figure 10: Deploying Hadoop

The system will now try to deploy your Hadoop instance to the Cloud. You might be prompted to create a staging directory as well while the script is running. Go ahead and type "y" to accept. You should now see a message saying "Deployment complete." It might take several minutes for your job to complete, so be patient. When it is finished, check to see whether your cluster is up by typing in "gcutil listinstances", where you will see something like what is shown in Figure 11.

how-to-run-hadoop-on-google-cloud-2-img-10

Figure 11: A list of Hadoop instances running

From here, you need to test your deployment, which you do via the command "gcutil ssh –project=YOURPROJECTID hs-ghfs-nn < Hadoop-validate-setup.sh" like in Figure 12.

Figure 12: Validating Hadoop deployment

If the script runs successfully, you should see an output like "teragen, terasort, teravalidate passed." From there, go ahead and delete the project by typing "./bdutil delete." This will delete the deployed virtual machines (VMs) and associated artifacts. When it’s done, you should see message "Done deleting VMs!"

Summary

In this two part blog post series, you learned how to use the Google Cloud SDK to set up Hadoop via Windows and Cygwin. Now you have Cygwin set up and configured to build, connect to the Google Cloud, set up instances, and deploy Hadoop.

If you want even more Hadoop content, visit our Hadoop page. Featuring our latest releases and our top free Hadoop content, it's the centre of Packt's Big Data coverage.

About the author

Robi Sen, CSO at Department 13, is an experienced inventor, serial entrepreneur, and futurist whose dynamic twenty-plus year career in technology, engineering, and research has led him to work on cutting edge projects for DARPA, TSWG, SOCOM, RRTO, NASA, DOE, and the DOD. Robi also has extensive experience in the commercial space, including the co-creation of several successful start-up companies. He has worked with companies such as Under Armour, Sony, CISCO, IBM, and many others to help build out new products and services. Robi specializes in bringing his unique vision and thought process to difficult and complex problems, allowing companies and organizations to find innovative solutions that they can rapidly operationalize or go to market with.