Getting started with Amazon Machine Learning workflow [Tutorial]

Amazon Machine Learning is useful for building ML models and generating predictions. It also enables the development of robust and scalable smart applications. The process of building ML models with Amazon Machine Learning consists of three operations:

data analysis

model training

evaluation.

The code files for this article are available on Github.

This tutorial is an excerpt from a book written by Alexis Perrier titled Effective Amazon Machine Learning.

The Amazon Machine Learning service is available at https://console.aws.amazon.com/machinelearning/. The Amazon ML workflow closely follows a standard Data Science workflow with steps:

Extract the data and clean it up. Make it available to the algorithm.

Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part.

Select the best model by training several models on the training dataset and comparing their performances on the validation dataset.

Use the best model for predictions on new data.

As shown in the following Amazon ML menu, the service is built around four objects:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-0

Datasource

ML model

Evaluation

Prediction

The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. Let us take a closer look at each one of these steps.

Understanding the dataset used

We will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm.

This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. We do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable.

We will randomly select 20% of the rows as the held-out subset to use for prediction on previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor:

By creating a new column with randomly generated numbers

Sorting the spreadsheet by that column

Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split)

Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967.

As with all datasets, scripts, and resources mentioned in this book, the training and holdout files are available in the GitHub repository at https://github.com/alexperrier/packt-aml.

It is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model.

Loading the data on S3

Follow these steps to load the training and held-out datasets on S3:

Go to your s3 console at https://console.aws.amazon.com/s3.

Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration.

Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-1

Both files are small, only a few KB, and hosting costs should remain negligible for that exercise.

Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on.

Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv.

Declaring a datasource

Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-2

As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-3

Specifying a Datasource name is useful to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-4

Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents.

Creating the datasource

An Amazon ML datasource is composed of the following:

The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3

The schema that contains information on the type of the variables contained in the CSV file:
- Categorical
- Text
- Numeric (real-valued)
- Binary

It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML.

At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-5

Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-6

Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following:

The first line in the CSV file contains te column name

The target is the weight

We see here that Amazon ML has correctly inferred the following:

sex is categorical

age, height, and weight are numeric (continuous real values)

Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data:

predicted weight = a * age + b * height + c * sex

Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are useful when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model.

You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model.

Understanding the model

We select the default parameters for the training and evaluation settings. Amazon ML will do the following:

Create a recipe for data transformation based on the statistical properties it has inferred from the dataset

Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially.

The recipe will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed.

The default advanced settings for the model are shown in the following screenshot:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-7

We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on.

Regularization comes in 3 levels with a mild (10^{^-6}), medium (10^{^-4}), or aggressive (10^{^-02}) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^{^-6}) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set).

Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending.

At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation.

It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following:

{
  "splitting": {
    "percentBegin": 0,
    "percentEnd": 70
  }
}

You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot:

While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters:

Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100.

Making several passes over the training data while shuffling the samples before each path.

At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made.

At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version.

Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation.

Evaluating the model

Amazon ML uses the standard metric RMSE for linear regression. RMSE is defined as the sum of the squares of the difference between the real values and the predicted values:

Here, ŷ is the predicted values, and y the real values we want to predict (the weight of the children in our case). The closer the predictions are to the real values, the lower the RMSE is. A lower RMSE means a better, more accurate prediction.

Making batch predictions

We now have a model that has been properly trained and selected among other models. We can use it to make predictions on new data.

A batch prediction consists in applying a model to a datasource in order to make predictions on that datasource. We need to tell Amazon ML which model we want to apply on which data.

Batch predictions are different from streaming predictions. With batch predictions, all the data is already made available as a datasource, while for streaming predictions, the data will be fed to the model as it becomes available. The dataset is not available beforehand in its entirety.

In the Main Menu select Batch Predictions to access the dashboard predictions and click on Create a New Prediction:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-10

The first step is to select one of the models available in your model dashboard. You should choose the one that has the lowest RMSE:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-11

The next step is to associate a datasource to the model you just selected. We had uploaded the held-out dataset to S3 at the beginning of this chapter (under the Loading the data on S3 section) but had not used it to create a datasource.

We will do so now.When asked for a datasource in the next screen, make sure to check My data is in S3, and I need to create a datasource, and then select the held-out dataset that should already be present in your S3 bucket:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-12

Don't forget to tell Amazon ML that the first line of the file contains columns.

In our current project, our held-out dataset also contains the true values for the weight of the students. This would not be the case for "real" data in a real-world project where the real values are truly unknown. However, in our case, this will allow us to calculate the RMSE score of our predictions and assess the quality of these predictions.

The final step is to click on the Verify button and wait for a few minutes:

Amazon ML will run the model on the new datasource and will generate predictions in the form of a CSV file.

Contrary to the evaluation and model-building phase, we now have real predictions. We are also no longer given a score associated with these predictions.

After a few minutes, you will notice a new batch-prediction folder in your S3 bucket. This folder contains a manifest file and a results folder. The manifest file is a JSON file with the path to the initial datasource and the path to the results file. The results folder contains a gzipped CSV file:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-13

Uncompressed, the CSV file contains two columns, trueLabel, the initial target from the held-out set, and score, which corresponds to the predicted values. We can easily calculate the RMSE for those results directly in the spreadsheet through the following steps:

Creating a new column that holds the square of the difference of the two columns.

Summing all the rows.

Taking the square root of the result.

The following illustration shows how we create a third column C, as the squared difference between the trueLabel column A and the score (or predicted value) column B:

getting-started-with-amazon-machine-learning-workflow-tutorial-img-14

As shown in the following screenshot, averaging column C and taking the square root gives an RMSE of 11.96, which is even significantly better than the RMSE we obtained during the evaluation phase (RMSE 14.4):

getting-started-with-amazon-machine-learning-workflow-tutorial-img-15

The fact that the RMSE on the held-out set is better than the RMSE on the validation set means that our model did not overfit the training data, since it performed even better on new data than expected. Our model is robust.

The left side of the following graph shows the True (Triangle) and Predicted (Circle) Weight values for all the samples in the held-out set. The right side shows the histogram of the residuals. Similar to the histogram of residuals we had observed on the validation set, we observe that the residuals are not centered on 0. Our model has a tendency to overestimate the weight of the students: