Understanding Amazon Machine Learning Workflow [ Tutorial ]

This article presents an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. Amazon Machine Learning is an online service by Amazon Web Services (AWS) that does supervised learning for predictive analytics.

Launched in April 2015 at the AWS Summit, Amazon ML joins a growing list of cloud-based machine learning services, such as Microsoft Azure, Google prediction, IBM Watson, Prediction IO, BigML, and many others. These online machine learning services form an offer commonly referred to as Machine Learning as a Service or MLaaS following a similar denomination pattern of other cloud-based services such as SaaS, PaaS, and IaaS respectively for Software, Platform, or Infrastructure as a Service.

The Amazon ML workflow closely follows a standard Data Science workflow with steps:

Extract the data and clean it up. Make it available to the algorithm.

Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part.

Select the best model by training several models on the training dataset and comparing their performances on the validation dataset.

Use the best model for predictions on new data.

This article is an excerpt taken from the book 'Effective Amazon Machine Learning' written by Alexis Perrier.

As shown in the following Amazon ML menu, the service is built around four objects:

understanding-amazon-machine-learning-workflow-img-0

Datasource

ML model

Evaluation

Prediction

The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. We will take a closer look at the Datasource and ML model.

Amazon ML dataset

For the rest of the article, we will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm.

This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. In short, we do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable.

We will randomly select 20% of the rows as the held-out subset to use for the prediction of previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor:

By creating a new column with randomly generated numbers

Sorting the spreadsheet by that column

Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split)

Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967.

Note that it is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model.

Loading the data on Amazon S3

Follow these steps to load the training and held-out datasets on S3:

Go to your s3 console at https://console.aws.amazon.com/s3.

Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration.

Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu:

understanding-amazon-machine-learning-workflow-img-1

Both files are small, only a few KB, and hosting costs should remain negligible for that exercise.

Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on.

Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv.

Declaring a datasource

Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default:

understanding-amazon-machine-learning-workflow-img-2

As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user:

understanding-amazon-machine-learning-workflow-img-3

Specifying a Datasource name is used to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot:

understanding-amazon-machine-learning-workflow-img-4

Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents.

Creating the datasource

An Amazon ML datasource is composed of the following:

The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3

The schema that contains information on the type of the variables contained in the CSV file:
- Categorical
- Text
- Numeric (real-valued)
- Binary

It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML.

At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has:

understanding-amazon-machine-learning-workflow-img-5

Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot:

understanding-amazon-machine-learning-workflow-img-6

Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following:

The first line in the CSV file contains te column name

The target is the weight

We see here that Amazon ML has correctly inferred the following:

sex is categorical

age, height, and weight are numeric (continuous real values)

Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data:

predicted weight = a * age + b * height + c * sex

Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are used when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model.

You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target, and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model.

The machine learning model

We select the default parameters for the training and evaluation settings. Amazon ML will do the following:

Create a step for data transformation based on the statistical properties it has inferred from the dataset

Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially.

The step will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed.

The default advanced settings for the model are shown in the following screenshot:

understanding-amazon-machine-learning-workflow-img-7

We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on.

Regularization comes in 3 levels with a mild (10^{^-6}), medium (10^{^-4}), or aggressive (10^{^-02}) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^{^-6}) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set).

Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending.

At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation.

It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following:

{
  "splitting": {
    "percentBegin": 0,
    "percentEnd": 70
  }
}

You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot:

While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters:

Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100.

Making several passes over the training data while shuffling the samples before each path.

At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made.

At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version.

Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation. The Amazon ML flow is smooth and facilitates the inherent data science loop: data, model, evaluation, and prediction.

We looked at an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. We discussed two objects of the Amazon ML menu: Datasource and ML model.

If you found this post useful, be sure to check out the book 'Effective Amazon Machine Learning' to learn about evaluation and prediction in Amazon ML along with other AWS ML concepts.

Integrate applications with AWS services: Amazon DynamoDB & Amazon Kinesis [Tutorial]

AWS makes Amazon Rekognition, its image recognition AI, available for Asia-Pacific developers