AWS machine learning: Learning AWS CLI to execute a simple Amazon ML workflow [Tutorial]

Using the AWS web interface to manage and run your projects is time-consuming. We will, therefore, start running our projects via the command line with the AWS Command Line Interface (AWS CLI). With just one tool to download and configure, multiple AWS services can be controlled from the command line and they can be automated through scripts.

The code files for this article are available on Github.

This article is an excerpt from a book written by Alexis Perrier titled Effective Amazon Machine Learning.

Getting started and setting up

Creating a performing predictive model from raw data requires many trials and errors, much back and forth. Creating new features, cleaning up data, and trying out new parameters for the model are needed to ensure the robustness of the model. There is a constant back and forth between the data, the models, and the evaluations. Scripting this workflow either via the AWS CLI will give us the ability to speed up the create, test, select loop.

Installing AWS CLI

In order to set up your CLI credentials, you need your access key ID and your secret access key. You can simply create them from the IAM console (https://console.aws.amazon.com/iam).

Navigate to Users, select your IAM user name and click on the Security credentials tab. Choose Create Access Key and download the CSV file. Store the keys in a secure location. We will need the key in a few minutes to set up AWS CLI. But first, we need to install AWS CLI.

Docker environment – This tutorial will help you use the AWS CLI within a docker container: https://blog.flowlog-stats.com/2016/05/03/aws-cli-in-a-docker-container/. A docker image for running the AWS CLI is available at https://hub.docker.com/r/fstab/aws-cli/.

There is no need to rewrite the AWS documentation on how to install the AWS CLI. It is complete and up to date, and available at http://docs.aws.amazon.com/cli/latest/userguide/installing.html. In a nutshell, installing the CLI requires you to have Python and pip already installed.

Then, run the following:

$ pip install --upgrade --user awscli

Add AWS to your $PATH:

$ export PATH=~/.local/bin:$PATH

Reload the bash configuration file (this is for OSX):

$ source ~/.bash_profile

Check that everything works with the following command:

$ aws --version

You should see something similar to the following output:

$ aws-cli/1.11.47 Python/3.5.2 Darwin/15.6.0 botocore/1.5.10

Once installed, we need to configure the AWS CLI type:

$ aws configure

Now input the access keys you just created:

$ aws configure

AWS Access Key ID [None]: ABCDEF_THISISANEXAMPLE
AWS Secret Access Key [None]: abcdefghijk_THISISANEXAMPLE
Default region name [None]: us-west-2
Default output format [None]: json

Choose the region that is closest to you and the format you prefer (JSON, text, or table). JSON is the default format.

The AWS configure command creates two files: a config file and a credential file. On OSX, the files are ~/.aws/config and ~/.aws/credentials. You can directly edit these files to change your access or configuration. You will need to create different profiles if you need to access multiple AWS accounts. You can do so via the AWS configure command:

$ aws configure --profile user2

You can also do so directly in the config and credential files:

~/.aws/config
[default]
output = json
region = us-east-1

[profile user2]
output = text
region = us-west-2

You can edit Credential file as follows:

~/.aws/credentials
[default]
aws_secret_access_key = ABCDEF_THISISANEXAMPLE
aws_access_key_id = abcdefghijk_THISISANEXAMPLE

[user2]
aws_access_key_id = ABCDEF_ANOTHERKEY
aws_secret_access_key = abcdefghijk_ANOTHERKEY

Refer to the AWS CLI setup page for more in-depth information:
http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html

Picking up CLI syntax

The overall format of any AWS CLI command is as follows:

$ aws <service> [options] <command> <subcommand> [parameters]

Here the terms are stated as:

<service>: Is the name of the service you are managing: S3, machine learning, and EC2

[options] : Allows you to set the region, the profile, and the output of the command

<command> <subcommand>: Is the actual command you want to execute

[parameters] : Are the parameters for these commands

A simple example will help you understand the syntax better. To list the content of an S3 bucket named aml.packt, the command is as follows:

$ aws s3 ls aml.packt

Here, s3 is the service, ls is the command, and aml.packt is the parameter. The aws help command will output a list of all available services.

There are many more examples and explanations on the AWS documentation available at
http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-using.html.

Passing parameters using JSON files

For some services and commands, the list of parameters can become long and difficult to check and maintain.

For instance, in order to create an Amazon ML model via the CLI, you need to specify at least seven different elements: the Model ID, name, type, the model's parameters, the ID of the training data source, and the recipe name and URI (aws machinelearning create-ml-model help ).

When possible, we will use the CLI ability to read parameters from a JSON file instead of specifying them in the command line. AWS CLI also offers a way to generate a JSON template, which you can then use with the right parameters. To generate that JSON parameter file model (the JSON skeleton), simply add --generate-cli-skeleton after the command name. For instance, to generate the JSON skeleton for the create model command of the machine learning service, write the following:

$ aws machinelearning create-ml-model --generate-cli-skeleton

This will give the following output:

{
   "MLModelId": "",
   "MLModelName": "",
   "MLModelType": "",
   "Parameters": {
       "KeyName": ""
   },
   "TrainingDataSourceId": "",
   "Recipe": "",
   "RecipeUri": ""
}

You can then configure this to your liking.

To have the skeleton command generate a JSON file and not simply output the skeleton in the terminal, add > filename.json:

$ aws machinelearning create-ml-model --generate-cli-skeleton > filename.json

This will create a filename.json file with the JSON template. Once all the required parameters are specified, you create the model with the command (assuming the filename.json is in the current folder):

$ aws machinelearning create-ml-model file://filename.json

Before we dive further into the machine learning workflow via the CLI, we need to introduce the dataset we will be using in this chapter.

Introducing the Ames Housing dataset

We will use the Ames Housing dataset that was compiled by Dean De Cock for use in data science education. It is a great alternative to the popular but older Boston Housing dataset. The Ames Housing dataset is used in the Advanced Regression Techniques challenge on the Kaggle website: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/. The original version of the dataset is available: http://www.amstat.org/publications/jse/v19n3/decock/AmesHousing.xls and in the GitHub repository for this chapter.

For more information on the genesis of this dataset and an in-depth explanation of the different variables, read the paper by Dean De Cock available in PDF at https://ww2.amstat.org/publications/jse/v19n3/decock.pdf.

We will start by splitting the dataset into a train and a validate set and build a model on the train set. Both train and validate sets are available in the GitHub repository as ames_housing_training.csv and ames_housing_validate.csv. The entire dataset is in the ames_housing.csv file.

Splitting the dataset with shell commands

We will use shell commands to shuffle, split, and create training and validation subsets of the Ames Housing dataset:

First, extract the first line into a separate file, ames_housing_header.csv and remove it from the original file:

        $ head -n 1 ames_housing.csv > ames_housing_header.csv

We just tail all the lines after the first one into the same file:

        $ tail -n +2 ames_housing.csv > ames_housing_nohead.csv

Then randomly sort the rows into a temporary file. (gshuf is the OSX equivalent of the Linux shuf shell command. It can be installed via brew install coreutils):

        $ gshuf ames_housing_nohead.csv -o ames_housing_nohead.csv

Extract the first 2,050 rows as the training file and the last 880 rows as the validation file:

        $ head -n 2050 ames_housing_nohead.csv > ames_housing_training.csv
    
        $ tail -n 880 ames_housing_nohead.csv > ames_housing_validate.csv

Finally, add back the header into both training and validation files:

        $ cat ames_housing_header.csv ames_housing_training.csv > tmp.csv 
        $ mv tmp.csv ames_housing_training.csv

$ cat ames_housing_header.csv ames_housing_validate.csv > tmp.csv
$ mv tmp.csv ames_housing_validate.csv

A simple project using AWS CLI

We are now ready to execute a simple Amazon ML workflow using the CLI. This includes the following:

Uploading files on S3

Creating a datasource and the recipe

Creating a model

Creating an evaluation

Prediction batch and real time

Let's start by uploading the training and validation files to S3. In the following lines, replace the bucket name aml.packt with your own bucket name.

To upload the files to the S3 location s3://aml.packt/data/ch8/, run the following command lines:

$ aws s3 cp ./ames_housing_training.csv s3://aml.packt/data/ch8/
upload: ./ames_housing_training.csv to s3://aml.packt/data/ch8/ames_housing_training.csv

$ aws s3 cp ./ames_housing_validate.csv s3://aml.packt/data/ch8/
upload: ./ames_housing_validate.csv to s3://aml.packt/data/ch8/ames_housing_validate.csv

An overview of Amazon ML CLI commands

That's it for the S3 part. Now let's explore the CLI for Amazon's machine learning service.
All Amazon ML CLI commands are available at http://docs.aws.amazon.com/cli/latest/reference/machinelearning/. There are 30 commands, which can be grouped by object and action.

You can perform the following:

create : creates the object

describe: searches objects given some parameters (location, dates, names, and so on)

get: given an object ID, returns information

update: given an object ID, updates the object

delete: deletes an object

These can be performed on the following elements:

datasource
- create-data-source-from-rds
- create-data-source-from-redshift
- create-data-source-from-s3
- describe-data-sources
- delete-data-source
- get-data-source
- update-data-source

ml-model
- create-ml-model
- describe-ml-models
- get-ml-model
- delete-ml-model
- update-ml-model

evaluation
- create-evaluation
- describe-evaluations
- get-evaluation
- delete-evaluation
- update-evaluation

batch prediction
- create-batch-prediction
- describe-batch-predictions
- get-batch-prediction
- delete-batch-prediction
- update-batch-prediction

real-time end point
- create-realtime-endpoint
- delete-realtime-endpoint
- predict

You can also handle tags and set waiting times.

Note that the AWS CLI gives you the ability to create datasources from S3, Redshift, and RDS, while the web interface only allowed datasources from S3 and Redshift.

Creating the datasource

We will start by creating the datasource. Let's first see what parameters are needed by generating the following skeleton:

$ aws machinelearning create-data-source-from-s3 --generate-cli-skeleton

This generates the following JSON object:

{
   "DataSourceId": "",
   "DataSourceName": "",
   "DataSpec": {
       "DataLocationS3": "",
       "DataRearrangement": "",
       "DataSchema": "",
       "DataSchemaLocationS3": ""
   },
   "ComputeStatistics": true
}

The different parameters are mostly self-explanatory and further information can be found on the AWS documentation at http://docs.aws.amazon.com/cli/latest/reference/machinelearning/create-data-source-from-s3.html.

A word on the schema: when creating a datasource from the web interface, you have the possibility to use a wizard, to be guided through the creation of the schema. The wizard facilitates the process by guessing the type of the variables, thus making available a default schema that you can modify.

There is no default schema available via the AWS CLI. You have to define the entire schema yourself, either in a JSON format in the DataSchema field or by uploading a schema file to S3 and specifying its location, in the DataSchemaLocationS3 field.

Since our dataset has many variables (79), we cheated and used the wizard to create a default schema that we uploaded to S3. Throughout the rest of the chapter, we will specify the schema location not its JSON definition.

In this example, we will create the following datasource parameter file, dsrc_ames_housing_001.json:

{
   "DataSourceId": "ch8_ames_housing_001",
   "DataSourceName": "[DS] Ames Housing 001",
   "DataSpec": {
       "DataLocationS3": 
         "s3://aml.packt/data/ch8/ames_housing_training.csv",
       "DataSchemaLocationS3": 
         "s3://aml.packt/data/ch8/ames_housing.csv.schema"        
   },
   "ComputeStatistics": true
}

For the validation subset (save to dsrc_ames_housing_002.json):

{
   "DataSourceId": "ch8_ames_housing_002",
   "DataSourceName": "[DS] Ames Housing 002",
   "DataSpec": {
       "DataLocationS3": 
         "s3://aml.packt/data/ch8/ames_housing_validate.csv",
       "DataSchemaLocationS3": 
         "s3://aml.packt/data/ch8/ames_housing.csv.schema"        
   },
   "ComputeStatistics": true
}

Since we have already split our data into a training and a validation set, there's no need to specify the data DataRearrangement field.

Alternatively, we could also have avoided splitting our dataset and specified the following DataRearrangement on the original dataset, assuming it had been already shuffled: (save to dsrc_ames_housing_003.json):

{
   "DataSourceId": "ch8_ames_housing_003",
   "DataSourceName": "[DS] Ames Housing training 003",
   "DataSpec": {
       "DataLocationS3": 
         "s3://aml.packt/data/ch8/ames_housing_shuffled.csv",
       "DataRearrangement": 
         "{"splitting":{"percentBegin":0,"percentEnd":70}}",
       "DataSchemaLocationS3":
         "s3://aml.packt/data/ch8/ames_housing.csv.schema"        
   },
   "ComputeStatistics": true
}

For the validation set (save to dsrc_ames_housing_004.json):

{
   "DataSourceId": "ch8_ames_housing_004",
   "DataSourceName": "[DS] Ames Housing validation 004",
   "DataSpec": {
       "DataLocationS3":
         "s3://aml.packt/data/ch8/ames_housing_shuffled.csv",
       "DataRearrangement": 
         "{"splitting":{"percentBegin":70,"percentEnd":100}}",
   },
   "ComputeStatistics": true
}

Here, the ames_housing.csv file has previously been shuffled using the gshuf command line and uploaded to S3:

$ gshuf ames_housing_nohead.csv -o ames_housing_nohead.csv
$ cat ames_housing_header.csv ames_housing_nohead.csv > tmp.csv
$ mv tmp.csv ames_housing_shuffled.csv
$ aws s3 cp ./ames_housing_shuffled.csv s3://aml.packt/data/ch8/

Note that we don't need to create these four datasources; these are just examples of alternative ways to create datasources.

We then create these datasources by running the following:

$ aws machinelearning create-data-source-from-s3 --cli-input-json file://dsrc_ames_housing_001.json

We can check whether the datasource creation is pending:

In return, we get the datasoure ID we had specified:

{
   "DataSourceId": "ch8_ames_housing_001"
}

We can then obtain information on that datasource with the following:

$ aws machinelearning  get-data-source --data-source-id ch8_ames_housing_001

This returns the following:

{
   "Status": "COMPLETED",
   "NumberOfFiles": 1,
   "CreatedByIamUser": "arn:aws:iam::178277xxxxxxx:user/alexperrier",
   "LastUpdatedAt": 1486834110.483,
   "DataLocationS3": "s3://aml.packt/data/ch8/ames_housing_training.csv",
   "ComputeStatistics": true,
   "StartedAt": 1486833867.707,
   "LogUri": "https://eml-prod-emr.s3.amazonaws.com/178277513911-ds-ch8_ames_housing_001/.....",
   "DataSourceId": "ch8_ames_housing_001",
   "CreatedAt": 1486030865.965,
   "ComputeTime": 880000,
   "DataSizeInBytes": 648150,
   "FinishedAt": 1486834110.483,
   "Name": "[DS] Ames Housing 001"
}

Note that we have access to the operation log URI, which could be useful to analyze the model training later on.

Creating the model

Creating the model with the create-ml-model command follows the same steps:

Generate the skeleton with the following:

        $ aws machinelearning create-ml-model --generate-cli-skeleton > 
        mdl_ames_housing_001.json

Write the configuration file:

        {
            "MLModelId": "ch8_ames_housing_001",
            "MLModelName": "[MDL] Ames Housing 001",
            "MLModelType": "REGRESSION",
            "Parameters": {
                "sgd.shuffleType": "auto",
                "sgd.l2RegularizationAmount": "1.0E-06",
                "sgd.maxPasses": "100"
            },
            "TrainingDataSourceId": "ch8_ames_housing_001",
            "RecipeUri": "s3://aml.packt/data/ch8
              /recipe_ames_housing_001.json"
        }

Note the parameters of the algorithm. Here, we used mild L2 regularization and 100 passes.

Launch the model creation with the following:

        $ aws machinelearning create-ml-model --cli-input-json 
        file://mdl_ames_housing_001.json

The model ID is returned:

        {
            "MLModelId": "ch8_ames_housing_001"
        }

This get-ml-model command gives you a status update on the operation as well as the URL to the log.

        $ aws machinelearning get-ml-model --ml-model-id 
        ch8_ames_housing_001

The watch command allows you to repeat a shell command every n seconds. To get the status of the model creation every 10s, just write the following:

        $ watch -n 10 aws machinelearning get-ml-model --ml-model-id 
        ch8_ames_housing_001

The output of the get-ml-model will be refreshed every 10s until you kill it.

It is not possible to create the default recipe via the AWS CLI commands. You can always define a blank recipe that would not carry out any transformation on the data. However, the default recipe has been shown to be positively impacting the model performance. To obtain this default recipe, we created it via the web interface, copied it into a file that we uploaded to S3. The resulting file recipe_ames_housing_001.json is available in our GitHub repository. Its content is quite long as the dataset has 79 variables and is not reproduced here for brevity purposes.

Evaluating our model with create-evaluation

Our model is now trained and we would like to evaluate it on the evaluation subset. For that, we will use the create-evaluation CLI command:

Generate the skeleton:

        $ aws machinelearning create-evaluation --generate-cli-skeleton >  
        eval_ames_housing_001.json

Configure the parameter file:

        {
            "EvaluationId": "ch8_ames_housing_001",
            "EvaluationName": "[EVL] Ames Housing 001",
            "MLModelId": "ch8_ames_housing_001",
            "EvaluationDataSourceId": "ch8_ames_housing_002"
        }

Launch the evaluation creation:

        $ aws machinelearning create-evaluation --cli-input-json 
        file://eval_ames_housing_001.json

Get the evaluation information:

        $ aws machinelearning get-evaluation --evaluation-id 
        ch8_ames_housing_001

From that output, we get the performance of the model in the form of the RMSE:

        "PerformanceMetrics": {
            "Properties": {
                 "RegressionRMSE": "29853.250469108018"
            }
        }

The value may seem big, but it is relative to the range of the salePrice variable for the houses, which has a mean of 181300.0 and std of 79886.7. So an RMSE of 29853.2 is a decent score.

You don't have to wait for the datasource creation to be completed in order to launch the model training. Amazon ML will simply wait for the parent operation to conclude before launching the dependent one. This makes chaining operations possible.

At this point, we have a trained and evaluated model.

In this tutorial, we have successfully seen the detailed steps on how to get started with CLI and we have also implemented a simple project to get comfortable with the same.

To understand how to leverage Amazon's powerful platform for your predictive analytics needs, check out this book Effective Amazon Machine Learning

Part1. Learning AWS CLI

Part2. ChatOps with Slack and AWS CLI

Automate tasks using Azure PowerShell and Azure CLI [Tutorial]