In this article by Raghav Bali and Dipanjan Sarkar, author of the book R Machine Learning By Example, we will discuss about, collaborative filtering is a simple yet very effective approach for predicting and recommending items to users. If we look closely, the algorithms work on input data, which is nothing but a matrix representation of the user ratings for different products.

Bringing in a mathematical perspective into the picture, matrix factorization is a technique to manipulate matrices and identify latent or hidden features from the data represented in the matrix. Building upon the same concept, let us use matrix factorization as the basis for predicting ratings for items which the user has not yet rated.

(For more resources related to this topic, see here.)

Matrix factorization

Matrix factorization refers to identification of two or more matrices such that when these matrices are multiplied we get the original matrix. Matrix factorization, as mentioned earlier, can be used to discover latent features between two different kinds of entities. We will understand and use the concepts of matrix factorization as we go along preparing our recommender engine for our e-commerce platform.

As our aim for the current project is to personalize the shopping experience and recommend product ratings for an e-commerce platform, our input data contains user ratings for various products on the website. We process the input data and transform it into matrix representation for analyzing it using matrix factorization. The input data looks like this:

building-product-recommendation-system-img-0
^{User ratings matrix}

As you can see, the input data is a matrix with each row representing a particular user's rating for different items represented in the columns. For the current case, the columns representing items are different mobile phones such as iPhone 4, iPhone 5s, Nexus 5, and so on. Each row contains ratings for each of these mobile phones as given by eight different users. The ratings range from 1 to 5 with 1 being the lowest and 5 being the highest. A rating of 0 represents unrated items or missing rating.

The task of our recommender engine will be to predict the correct rating for the missing ones in the input matrix. We could then use the predicted ratings to recommend items most desired by the user.

The premise here is that two users would rate a product similarly if they like similar features of the product or item. Since our current data is related to user ratings for different mobile phones, people might rate the phones based on their hardware configuration, price, OS, and so on. Hence, matrix factorization tries to identify these latent features to predict ratings for a certain user and a certain product.

While trying to identify these latent features, we proceed with the basic assumption that the number of such features is less than the total number of items in consideration. This assumption makes sense because if this was the case, then each user would have a specific feature associated with him/her (and similarly for the product). This would in turn make recommendations futile as none of the users would be interested in items rated by the other users (which is not the case usually).

Now let us get into the mathematical details of matrix factorization and our recommender engine.

Since we are dealing with user ratings for different products, let us assume U to be a matrix representing user preferences and similarly, a matrix P representing the products for which we have the ratings. Then the ratings matrix R will be defined as R = U x PT (we take transpose of P as PT for matrix multiplication) where, |R| = |U| x |P|.

Assuming the process helps us identify K latent features, our aim is to find two matrices X and Y such that their product (matrix multiplication) approximates R.

X = |U| x K matrix
Y = |P| x K matrix

Here, X is a user related matrix which represents the associations between the users and the latent features. Y, on the other hand, is the product related matrix which represents the associations between the products and the latent features.

The task of predicting the rating of a product pj by a user ui is done by calculating the dot product of the vectors corresponding to pj (vector Y, that is the user) and ui (vector X, that is the product).

building-product-recommendation-system-img-2

Now, to find the matrices X and Y, we utilize a technique called gradient descent. Gradient descent, in simple terms, tries to find the local minimum of a function; it is an optimization technique. We use gradient descent in the current context to iteratively minimize the difference between the predicted ratings and the actual ratings. To begin with, we randomly initialize the matrices X and Y and then calculate how different their product is from the actual ratings matrix R.

The difference between the predicted and the actual values is what is termed as the error. For our problem, we will consider the squared error, which is calculated as:

building-product-recommendation-system-img-3

Here, rij is the actual rating by user i for product j and is the predicted value of the same.

To minimize the error, we need to find the correct direction or gradient to change our values to. To obtain the gradient for each of the variables x and y, we differentiate them separately as:

building-product-recommendation-system-img-5

Hence, the equations to find xik and ykj can be given as:

building-product-recommendation-system-img-6

Here α is the constant to denote the rate of descent or the rate of approaching the minima (also known as the learning rate). The value of α defines the size of steps we take in either direction to reach the minima. Large values may lead to oscillations as we may overshoot the minima every time. Usual practice is to select very small values for α, of the order 10-4. and are the updated values of xik and ykj after each iteration of gradient descent.

To avoid overfitting, along with controlling extreme or large values in the matrices X and Y, we introduce the concept of regularization. Formally, regularization refers to the process of introducing additional information in order to prevent overfitting. Regularization penalizes models with extreme values.

To prevent overfitting in our case, we introduce the regularization constant called β. With introduction of β, the equations are updated as follows:

building-product-recommendation-system-img-9

Also,

building-product-recommendation-system-img-10

building-product-recommendation-system-img-11

As we already have the ratings matrix R and we use it determine how far are our predicted values from the actual, matrix factorization turns into a supervised learning problem. We use some of the rows as our training samples. Let S be our training set with elements being tuples of the form (ui, pj, rij). Thus, our task is to minimise the error (eij) for every tuple (ui, pj, rij) ϵ training set S.

The overall error (say E) can be calculated as:

E = ∑(ui, pj, rij) ϵS eij
  = ∑(ui, pj, rij) ϵS (rij - ∑(k=1 to K) xikykj)2

Implementation

Now that we have looked into the mathematics of matrix factorization, let us convert the algorithm into code and prepare a recommender engine for the mobile phone ratings input data set discussed earlier.

As shown in the Matrix factorization section, the input dataset is a matrix with each row representing a user's rating for the products mentioned as columns. The ratings range from 1 to 5 with 0 representing the missing values.

To transform our algorithm into working code, we need to compute and complete the following tasks:

Load the input data and transform it into ratings matrix representation
Prepare a matrix factorization based recommendation model
Predict and recommend products to the users
Interpret and evaluate the model

Loading and transforming input data into matrix representation is simple. As seen earlier, R provides us with easy to use utility functions for the same.

# load raw ratings from csv
raw_ratings <- read.csv(<file_name>)
 
# convert columnar data to sparse ratings matrix
ratings_matrix <- data.matrix(raw_ratings)

Now that we have our data loaded into an R matrix, we proceed and prepare the user-latent features matrix X and item-latent features matrix Y. We initialize both from uniform distributions using the runif function.

# number of rows in ratings
rows <- nrow(ratings_matrix)
 
# number of columns in ratings matrix
columns <- ncol(ratings_matrix)
 
# latent features
K <- 2
 
# User-Feature Matrix
X <- matrix(runif(rows*K), nrow=rows, byrow=TRUE)
 
# Item-Feature Matrix
Y <- matrix(runif(columns*K), nrow=columns, byrow=TRUE)

The major component is the matrix factorization function itself. Let us split the task into two, calculation of the gradient and subsequently the overall error.

The calculation of the gradient involves the ratings matrix R and the two factor matrices X and Y, along with the constants α and β. Since we are dealing with matrix manipulations (specifically, multiplication), we transpose Y before we begin with any further calculations. The following lines of code convert the algorithm discussed previously into R syntax. All variables follow naming convention similar to the algorithm for ease of understanding.

for (i in seq(nrow(ratings_matrix))){
 
      for (j in seq(length(ratings_matrix[i, ]))){
 
        if (ratings_matrix[i, j] > 0){
 
          # error
          eij = ratings_matrix[i, j] - as.numeric(X[i, ] %*% Y[, j])
 
       # gradient calculation
 
          for (k in seq(K)){
            X[i, k] = X[i, k] + alpha * (2 * eij * Y[k, j]/
            - beta * X[i, k])
 
            Y[k, j] = Y[k, j] + alpha * (2 * eij * X[i, k]/
            - beta * Y[k, j])
          }
        }
      }
    }

The next part of the algorithm is to calculate the overall error; we again use similar variable names for consistency:

# Overall Squared Error Calculation
 
e = 0
 
for (i in seq(nrow(ratings_matrix))){
 
   for (j in seq(length(ratings_matrix[i, ]))){
 
     if (ratings_matrix[i, j] > 0){
 
       e = e + (ratings_matrix[i, j] - /
           as.numeric(X[i, ] %*% Y[, j]))^2
 
       for (k in seq(K)){
          e = e + (beta/2) * (X[i, k]^2 + Y[k, j]^2)
        }
      }
    }
}

As a final piece, we iterate over these calculations multiple times to mitigate the risks of cold start and sparsity. We term the variable controlling multiple starts as epoch. We also terminate the calculations once the overall error drops below a certain threshold.

Moreover, as we had initialized X and Y from uniform distributions, the predicted values would be real numbers. We round the final output before returning the predicted matrix.

Note that this is a very simplistic implementation and a lot of complexity has been kept out for ease of understanding. Hence, this may result in the predicted matrix to contain values greater than 5. For the current scenario, it is safe to assume the values above the max scale of 5 as equivalent to 5 (and similarly for values lesser than 0). We encourage the reader to fine tune the code to handle such cases.

Setting α to 0.0002, β to 0.02, K (that is, latent features) to 2, and epoch to 1000, let us see a sample run of our code with overall error threshold set to 0.001:

# load raw ratings from csv
raw_ratings <- read.csv("product_ratings.csv")
 
# convert columnar data to sparse ratings matrix
ratings_matrix <- data.matrix(raw_ratings)
 
 
# number of rows in ratings
rows <- nrow(ratings_matrix)
 
# number of columns in ratings matrix
columns <- ncol(ratings_matrix)
 
# latent features
K <- 2
 
# User-Feature Matrix
X <- matrix(runif(rows*K), nrow=rows, byrow=TRUE)
 
# Item-Feature Matrix
Y <- matrix(runif(columns*K), nrow=columns, byrow=TRUE)
 
# iterations
epoch <- 10000
 
# rate of descent
alpha <- 0.0002
 
# regularization constant
beta <- 0.02
 
 
pred.matrix <- mf_based_ucf(ratings_matrix, X, Y, K, epoch = epoch)
 
# setting column names
colnames(pred.matrix)<-
c("iPhone.4","iPhone.5s","Nexus.5","Moto.X","Moto.G","Nexus.6",/
"One.Plus.One")

The preceding lines of code utilize the functions explained earlier to prepare the recommendation model. The predicted ratings or the output matrix looks like the following:

building-product-recommendation-system-img-12
_{Predicted ratings matrix}

Result interpretation

Let us do a quick visual inspection to see how good or bad our predictions have been. Consider users 1 and 3 as our training samples. From the input dataset, we can clearly see that user 1 has given high ratings to iPhones while user 3 has done the same for Android based phones. The following side by side comparison shows that our algorithm has predicted values close enough to the actual values:

building-product-recommendation-system-img-13
_{Ratings by user 1}

Let us see the ratings of user 3 in the following screenshot:

building-product-recommendation-system-img-14
_{Ratings by user 3}

Now that we have our ratings matrix with updated values, we are ready to recommend products to users. It is common sense to show only the products which the user hasn't rated yet. The right set of recommendations will also enable the seller to pitch the products which have high probability of being purchased by the user.

The usual practice is to return a list of top N items from the unrated list of products for each user. The user in consideration is usually termed as the active-user. Let us consider user 6 as our active-user. This user has only rated Nexus 6, One Plus One, Nexus 5, and iPhone4 in that order of rating, that is Nexus 6 was highly rated and iPhone4 was rated the least. Getting a list of Top 2 recommended phones for such a customer using our algorithm would result in Moto X and Moto G (very rightly indeed, do you see why?).

Thus, we built a recommender engine smart enough to recommend the right mobile phones to an Android Fanboy and saved the world from yet another catastrophe!

Data to the rescue!

This simple implementation of recommender engine using matrix factorization gave us a flavor of how such a system actually works. Next, let us get into some real world action using recommender engines.

Production ready recommender engines

In this article so far, we have learnt about recommender engines in detail and even developed one from scratch (using matrix factorization). Through all this, it is clearly evident how widespread is the application of such systems.

E-commerce websites (or for that fact, any popular technology platform) out there today have tonnes of content to offer. Not only that, but the number of users is also huge. In such a scenario, where thousands of users are browsing/buying stuff simultaneously across the globe, providing recommendations to them is a task in itself. To complicate things even further, a good user experience (response times, for example) can create a big difference between two competitors. These are live examples of production systems handling millions of customers day in and day out.

Fun Fact

Amazon.com is one of the biggest names in the e-commerce space with 244 million active customers. Imagine the amount of data being processed to provide recommendations to such a huge customer base browsing through millions of products!

Source: http://www.amazon.com/b?ie=UTF8&node=8445211011

In order to provide a seamless capability for use in such platforms, we need highly optimized libraries and hardware. For a recommender engine to handle thousands of users simultaneously every second, R has a robust and reliable framework called the recommenderlab.

Recommenderlab is a widely used R extension designed to provide a robust foundation for recommender engines. The focus of this library is to provide efficient handling of data, availability of standard algorithms and evaluation capabilities. In this section, we will be using recommenderlab to handle a considerably large data set for recommending items to users. We will also use the evaluation functions from recommenderlab to see how good or bad our recommendation system is. These capabilities will help us build a production ready recommender system similar (or at least closer) to what many online applications such as Amazon or Netflix use.

The dataset used in this section contains ratings for 100 items as rated by 5000 users. The data has been anonymised and the product names have been replaced by product IDs. The rating scale used is 0 to 5 with 1 being the worst, 5 being the best, and 0 representing unrated items or missing ratings.

To build a recommender engine using recommenderlab for a production ready system, the following steps are to be performed:

Extract, transform, and analyze the data.
Prepare a recommendation model and generate recommendations.
Evaluate the recommendation model.

We will look at all these steps in the following subsections.

Extract, transform, and analyze

As in case of any data intensive (particularly machine learning) application, the first and foremost step is to get the data, understand/explore it, and then transform it into the format required by the algorithm deemed fit for the current application. For our recommender engine using recommenderlab package, we will first load the data from a csv file described in the previous section and then explore it using various R functions.

# Load recommenderlab library
library("recommenderlab")
 
# Read dataset from csv file
raw_data <- read.csv("product_ratings_data.csv")
 
# Create rating matrix from data
ratings_matrix<- as(raw_data, "realRatingMatrix")
 
#view transformed data
image(ratings_matrix[1:6,1:10])

The preceding section of code loads the recommenderlab package and then uses the standard utility function to read the product_ratings_data.csv file. For exploratory as well as further steps, we need the data to be transformed into user-item ratings matrix format (as described in the Core concepts and definitions section).

The as(<data>,<type>) utility converts csv into the required ratings matrix format.

The csv file contains data in the format shown in the following screenshot. Each row contains a user's rating for a specific product. The column headers are self explanatory.

building-product-recommendation-system-img-15 _{Product ratings data}

The realRatingMatrix conversion transforms the data into a matrix as shown in the following image. The users are depicted as rows while the columns represent the products. Ratings are represented using a gradient scale where white represents missing/unrated rating while black denotes a rating of 5/best.

building-product-recommendation-system-img-16 _{Ratings matrix representation of our data}

Now that we have the data in our environment, let us explore some of its characteristics and see if we can decipher some key patterns.

First of all, we extract a representative sample from our main data set (refer to the screenshot Product ratings data) and analyse it for:

Average rating score for our user population
Spread/distribution of item ratings across the user population
Number of items rated per user

The following lines of code help us explore our data set sample and analyse the points mentioned previously:

# Extract a sample from ratings matrix
sample_ratings <-sample(ratings_matrix,1000)
 
# Get the mean product ratings as given by first user
rowMeans(sample_ratings[1,])
 
 
# Get distribution of item ratings
hist(getRatings(sample_ratings), breaks=100,/
     xlab = "Product Ratings",main = " Histogram of Product Ratings")
 
# Get distribution of normalized item ratings
hist(getRatings(normalize(sample_ratings)),breaks=100,/
            xlab = "Normalized Product Ratings",main = /
                " Histogram of Normalized Product Ratings")
 
# Number of items rated per user
hist(rowCounts(sample_ratings),breaks=50,/
     xlab = "Number of Products",main =/
     " Histogram of Product Count Distribution")

We extract a sample of 1,000 users from our dataset for exploration purposes. The mean of product ratings as given by the first row in our user-rating sample is 2.055. This tells us that this user either hasn't seen/rated many products or he usually rates the products pretty low. To get a better idea of how the users rate products, we generate a histogram of item rating distribution. This distribution peaks around the middle, that is, 3. The histogram is shown next:

building-product-recommendation-system-img-17 _{Histogram for ratings distribution}

The histogram shows that the ratings are normally distributed around the mean with low counts for products with very high or very low ratings.

Finally, we check the spread of the number of products rated by the users. We prepare a histogram which shows this spread:

building-product-recommendation-system-img-18 _{Histogram of number of rated products}

The preceding histogram shows that there are many users who have rated 70 or more products, as well as there are many users who have rated all the 100 products.

The exploration step helps us get an idea of how our data is. We also get an idea about the way the users generally rate the products and how many products are being rated.

Model preparation and prediction

We have the data in our R environment which has been transformed into the ratings matrix format. In this section, we are interested in preparing a recommender engine based upon user-based collaborative filtering. We will be using similar terminology as described in the previous sections. Recommenderlab provides straight forward utilities to learn and prepare a model for building recommender engines.

We prepare our model based upon a sample of just 1,000 users. This way, we can use this model to predict the missing ratings for rest of the users in our ratings matrix. The following lines of code utilize the first thousand rows for learning the model:

# Create 'User Based collaborative filtering' model
ubcf_recommender <- Recommender(ratings_matrix[1:1000],"UBCF")

"UBCF" in the preceding code signifies user-based collaborative filtering. Recommenderlab also provides other algorithms, such as IBCF or Item-Based Collaborative Filtering, PCA or Principal Component Analysis, and others as well.

After preparing the model, we use it to predict the ratings for our 1,010th and 1,011th users in the system. Recommenderlab also requires us to mention the number of items to be recommended to the users (in the order of preference of course). For the current case, we mention 5 as the number of items to be recommended.

# Predict list of product which can be recommended to given users
recommendations <- predict(ubcf_recommender,/
                  ratings_matrix[1010:1011], n=5)
 
# show recommendation in form of the list
as(recommendations, "list")

The preceding lines of code generate two lists, one for each of the users. Each element in these lists is a product for recommendation. The model predicted that for user 1,010, product prod_93 should be recommended as the top most product followed by prod_79, and so on.

# output generated by the model
[[1]]
[1] "prod_93" "prod_79" "prod_80" "prod_83" "prod_89"
 
[[2]]
[1] "prod_80" "prod_85" "prod_87" "prod_75" "prod_79"

Recommenderlab is a robust platform which is optimized to handle large datasets. With a few lines of code, we were able to load the data, learn a model, and even recommend products to the users in virtually no time. Compare this with the basic recommender engine we developed using matrix factorization which involved a lot many lines of code (when compared to recommenderlab) apart from the obvious difference in performance.

Model evaluation

We have successfully prepared a model and used it for predicting and recommending products to the users in our system. But what do we know about the accuracy of our model? To evaluate the prepared model, recommenderlab has handy and easy to use utilities. Since we need to evaluate our model, we need to split it into training and test data sets. Also, recommenderlab requires us to mention the number of items to be used for testing (it uses the rest for computing the error).

For the current case, we will use 500 users to prepare an evaluation model. The model will be based upon 90-10 training-testing dataset split with 15 items used for test sets.

# Evaluation scheme
eval_scheme <- evaluationScheme(ratings_matrix[1:500],/
                      method="split",train=0.9,given=15)
 
# View the evaluation scheme
eval_scheme
 
# Training model
training_recommender <- Recommender(getData(eval_scheme,/
                       "train"), "UBCF")
 
# Preditions on the test dataset
test_rating <- predict(training_recommender,/
               getData(eval_scheme, "known"), type="ratings")
 
#Error
error <- calcPredictionAccuracy(test_rating,/
                   getData(eval_scheme, "unknown"))
 
error

We use the evaluation scheme to train our model based upon UBCF algorithm. The prepared model from the training dataset is used to predict ratings for the given items. We finally use the method calcPredictionAccuracy to calculate the error in predicting the ratings between known and unknown components of the test set. For our case, we get an output as follows:

The generated output mentions the values for RMSE or root mean squared error, MSE or mean squared error, and MAE or mean absolute error. For RMSE in particular, the values deviate from the correct values by 1.162 (note that the values might deviate slightly across runs due to various factors such as sampling, iterations, and so on). This evaluation will make more sense when the outcomes are compared from different CF algorithms.

For evaluating UBCF, we use IBCF as comparator. The following few lines of code help us prepare an IBCF based model and test the ratings, which can then be compared using the calcPredictionAccuracy utility:

# Training model using IBCF
training_recommender_2 <- Recommender(getData(eval_scheme,/
                                     "train"), "IBCF")
 
# Preditions on the test dataset
test_rating_2 <- predict(training_recommender_2,/
                  getData(eval_scheme, "known"),/
                type="ratings")
 
error_compare <- rbind(calcPredictionAccuracy(test_rating,/
                getData(eval_scheme, "unknown")),/
                       calcPredictionAccuracy(test_rating_2,/
                getData(eval_scheme, "unknown")))
 
rownames(error_compare) <- c("User Based CF","Item Based CF")

The comparative output shows that UBCF outperforms IBCF with lower values of RMSE, MSE, and MAE.

Similarly, we can use the other algorithms available in recommenderlab to test/evaluate our models. We encourage the user to try out a few more and see which algorithm has the least error in predicted ratings.

Summary

In this arcticle, we continued our pursuit of using machine learning in the field of e-commerce to enhance sales and overall user experience. In this article, we accounted for the human factor and looked into the recommendation engines based upon user behavior.

We started off by understanding what are recommendation systems and their classifications into user-based, content-based, and hybrid recommender systems. We touched upon the problems associated with recommender engines in general. Then we dived deep into the specifics of Collaborative Filters and discussed the math around prediction and similarity measures. After getting our basics straight, we moved onto building a recommender engine of our own from scratch. We utilized matrix factorization to build a recommender engine step by step using a small dummy dataset. We then moved onto building a production ready recommender engine using R's popular library called recommenderlab. We used user-based CF as our core algorithm to build a recommendation model upon a bigger dataset containing ratings for 100 products by 5,000 users. We closed our discussion by evaluating our recommendation model using recommenderlab's utility methods.