How to build a cold-start friendly content-based recommender using Apache Spark SQL

[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. In this book, you will learn how to perform analytics on big data using Java, machine learning and other big data tools. [/box]

From the below given article, you will learn how to create the content-based recommendation system using movielens dataset.

Getting started with content-based recommendation systems

In content-based recommendations, the recommendation systems check for similarity between the items based on their attributes or content and then propose those items to the end users. For example, if there is a movie and the recommendation system has to show similar movies to the users, then it might check for the attributes of the movie such as the director name, the actors in the movie, the genre of the movie, and so on or if there is a news website and the recommendation system has to show similar news then it might check for the presence of certain words within the news articles to build the similarity criteria. As such the recommendations are based on actual content whether in the form of tags, metadata, or content from the item itself (as in the case of news articles).

Dataset

For this article, we are using the very popular movielens dataset, which was collected by the GroupLens Research Project at the University of Minnesota. This dataset contains a list of movies that are rated by their customers. It can be downloaded from the site https:/grouplens.org/datasets/movielens. There are few files in this dataset, they are:

u.item:This contains the information about the movies in the dataset.

The attributes in this file are:

Movie Id	This is the ID of the movie
Movie title	This is the title of the movie
Video release date	This is the release date of the movie
Genre (isAction, isComedy, isHorror)	This is the genre of the movie. This is represented by multiple flats such as isAction, isComedy, isHorror, and so on and as such it is just a Boolean flag with 1 representing users like this genre and 0 representing user do not like this genre.

u.data: This contains the data about the users rating for the movies that they have watched. As such there are three main attributes in this file and they are:

Userid	This is the ID of the user rating the movie
Item id	This is the ID of the movie
Rating	This is the rating given to the movie by the user

u.user: This contains the demographic information about the users. The main attributes from this file are:

User id	This is the ID of the user rating the movie
Age	This is the age of the user
Gender	This is the rating given to the movie by the user
ZipCode	This is the zip code of the place of the user

As the data is pretty clean and we are only dealing with a few parameters, we are not doing any extensive data exploration in this article for this dataset. However, we urge the users to try and practice data exploration on this dataset and try creating bar charts using the ratings given, and find patterns such as top-rated movies or top-rated action movies, or top comedy movies, and so on.

Content-based recommender on MovieLens dataset

We will build a simple content-based recommender using Apache Spark SQL and the MovieLens dataset. For figuring out the similarity between movies, we will use the Euclidean Distance. This Euclidean Distance would be run using the genre properties, that is, action, comedy, horror, and so on of the movie items and also run on the average rating for each movie.

First is the usual boiler plate code to create the SparkSession object. For this we first build the Spark configuration object and provide the master which in our case is local since we are running this locally in our computer. Next using this Spark configuration object we build the SparkSession object and also provide the name of the application here and that is ContentBasedRecommender.

SparkConf sc = new SparkConf().setMaster("local");

SparkSession spark = SparkSession

     .builder()

     .config(sc)

     .appName("ContentBasedRecommender")

     .getOrCreate();

Once the SparkSession is created, load the rating data from the u.data file. We are using the Spark RDD API for this, the user can feel free to change this code and use the dataset API if they prefer to use that. On this Java RDD of the data, we invoke a map function and within the map function code we extract the data from the dataset and populate the data into a Java POJO called RatingVO:

JavaRDD<RatingVO> ratingsRDD =

spark.read().textFile("data/movie/u.data").javaRDD()

.map(row -> {

RatingVO rvo = RatingVO.parseRating(row);

return rvo;

});

As you can see, the data from each row of the u.data dataset is passed to the RatingVO.parseRating method in the lambda function. This parseRating method is declared in the POJO RatingVO. Let's look at the RatingVO POJO first. For maintaining brevity of the code, we are just showing the first few lines of the POJO here:

public class RatingVO implements Serializable {

  private int userId;

private int movieId;

private float rating;

private long timestamp;

  private int like;

As you can see in the preceding code we are storing movieId, rating given by the user and the userId attribute in this POJO. This class RatingVO contains one useful method parseRating
as shown next. In this method, we parse the actual data row (that is tab separated) and extract the values of rating, movieId, and so on from it. In the method, do note the if...else block, it is here that we check whether the rating given is greater than three. If it is, then we take it as a 1 or like and populate it in the like attribute of RatingVO, else we mark it as 0 or dislike for the user:

public static RatingVO parseRating(String str) {

     String[] fields = str.split("t");

     ...

     int userId = Integer.parseInt(fields[0]);

     int movieId = Integer.parseInt(fields[1]);

     float rating = Float.parseFloat(fields[2]);

if(rating > 3) return new RatingVO(userId, movieId, rating,

timestamp,1);

     return new RatingVO(userId, movieId, rating, timestamp,0);

}

Once we have the RDD object built with the POJO's RatingVO contained per row, we are now ready to build our beautiful dataset object. We will use the createDataFrame method on the spark session and provide the RDD containing data and the corresponding POJO class, that is, RatingVO. We also register this dataset as a temporary view called ratings so that we can fire SQL queries on it:

Dataset<Row> ratings = spark.createDataFrame(ratingsRDD, RatingVO.class);

ratings.createOrReplaceTempView("ratings");

ratings.show();

The show method in the preceding code would print the first few rows of this dataset as shown next:

how-to-build-a-cold-start-friendly-content-based-recommender-using-apache-spark-sql-img-0

Next we will fire a Spark SQL query on this view (ratings) and in this SQL query we will find the average rating in this dataset for each movie. For this, we will group by on the movieId in this query and invoke an average function on the rating
column as shown here:

Dataset<Row> moviesLikeCntDS = spark.sql("select movieId,avg(rating) likesCount from ratings group by movieId");

The moviesLikeCntDS dataset now contains the results of our group by query. Next we load the data for the movies from the u.item data file. As we did for users, we store this data for movies in a
MovieVO POJO. This MovieVO POJO contains the data for the movies such as the MovieId, MovieTitle and it also stores the information about the movie genre such as action, comedy, animation, and so on. The genre information is stored as 1 or 0. For maintaining the brevity of the code, we are not showing the full code of the lambda function here:

JavaRDD<MovieVO> movieRdd = spark.read().textFile("data/movie/u.item").javaRDD()

.map(row -> {

 String[] strs = row.split("|");

 MovieVO mvo = new MovieVO();

 mvo.setMovieId(strs[0]);   

    mvo.setMovieTitle(strs[1]);

 ...

 mvo.setAction(strs[6]);

mvo.setAdventure(strs[7]);

    ...

 return mvo;

});

As you can see in the preceding code we split the data row and from the split result, which is an array of strings, we extract our individual values and store in the MovieVO object. The results of this operation are stored in the movieRdd object, which is a Spark RDD. Next, we convert this RDD into a Spark dataset. To do so, we invoke the Spark createDataFrame function and provide our movieRdd
to it and also supply the MovieVO POJO here. After creating the dataset for the movie RDD, we perform an important step here of combining our movie dataset with the moviesLikeCntDS dataset we created earlier (recall that moviesLikeCntDS dataset contains our movie ID and the average rating for that review in this movie dataset). We also register this new dataset as a temporary view so that we can fire Spark SQL queries on it:

Dataset<Row> movieDS = spark.createDataFrame(movieRdd.rdd(),

MovieVO.class).join(moviesLikeCntDS, "movieId");

movieDS.createOrReplaceTempView("movies");

Before we move further on this program, we will print the results of this new dataset. We will invoke the show method on this new dataset:

movieDS.show();

This would print the results as (for brevity we are not showing the full columns in the screenshot):

how-to-build-a-cold-start-friendly-content-based-recommender-using-apache-spark-sql-img-1

Now comes the turn for the meat of this content recommender program. Here we see the power of Spark SQL, we will now do a self join within a Spark SQL query to the temporary view movie. Here, we will make a combination of every movie to every other movie in the dataset except to itself. So if you have say three movies in the set as (movie1, movie2, movie3), this would result in the combinations (movie1, movie2), (movie1, movie3), (movie2, movie3), (movie2, movie1), (movie3, movie1), (movie3, movie2). You must have noticed by now that this query would produce duplicates as (movie1, movie2) is same as (movie2, movie1), so we will have to write separate code to remove those duplicates. But for now the code for fetching these combinations is shown as follows, for maintaining brevity we have not shown the full code:

Dataset<Row> movieDataDS =

spark.sql("select m.movieId movieId1,m.movieTitle movieTitle1,m.action action1,m.adventure adventure1, ... "

+ "m2.movieId movieId2,m2.movieTitle movieTitle2,m2.action action2,m2.adventure adventure2, ..."

If you invoke show on this dataset and print the result, it would print a lot of columns and their first few values. We will show some of the values next (note that we show values for movie1 on the left-hand side and the corresponding movie2 on the right-hand side):

how-to-build-a-cold-start-friendly-content-based-recommender-using-apache-spark-sql-img-2

Did you realize by now that on a big data dataset this operation is massive and requires extensive computing resources. Thankfully on Spark you can distribute this operation to run on multiple computer notes. For this, you can tweak the spark-submit job with the following parameters so as to extract the last bit of juice from all the clustered nodes:

spark-submit executor-cores <Number of Cores> --num-executor <Number of Executors>

The next step is the most important one in this content management program. We will run over the results of the previous query and from each row we will pull out the data for movie1 and movie2 and cross compare the attributes of both the movies and find the Euclidean Distance between them. This would show how similar both the movies are; the greater the distance, the less similar the movies are. As expected, we convert our dataset to an RDD and invoke a map function and within the lambda function for that map we go over all the dataset rows and from each row we extract the data and find the Euclidean Distance. For maintaining conciseness, we depict only a portion of the following code. Also note that we pull the information for both movies and store the calculated Euclidean Distance in a EuclidVO Java POJO object:

JavaRDD<EuclidVO> euclidRdd = movieDataDS.javaRDD().map( row -> {

   EuclidVO evo = new EuclidVO();

evo.setMovieId1(row.getString(0));

evo.setMovieTitle1(row.getString(1));

evo.setMovieTitle2(row.getString(22));

evo.setMovieId2(row.getString(21));

   int action = Math.abs(Integer.parseInt(row.getString(2)) –

Integer.parseInt(row.getString(23)) );

...

double likesCnt = Math.abs(row.getDouble(20) - row.getDouble(41));

double euclid = Math.sqrt(action * action + ... + likesCnt * likesCnt);

evo.setEuclidDist(euclid);

return evo;

});

As you can see in the bold text in the preceding code, we are calculating the Euclid Distance and storing it in a variable. Next, we convert our RDD to a dataset object and again register it in a temporary view movieEuclids and now are ready to fire queries for our predictions:

Dataset<Row> results = spark.createDataFrame(euclidRdd.rdd(), EuclidVO.class);

results.createOrReplaceTempView("movieEuclids");

Finally, we are ready to make our predictions using this dataset. Let's see our first prediction; let's find the top 10 movies that are closer to Toy Story, and this movie has movieId of 1. We will fire a simple query on the view movieEuclids to find this:

spark.sql("select * from movieEuclids where movieId1 = 1 order by euclidDist

asc").show(20);

As you can see in the preceding query, we order by Euclidean Distance in ascending order as lesser distance means more similarity. This would print the first 20 rows of the result as follows:

how-to-build-a-cold-start-friendly-content-based-recommender-using-apache-spark-sql-img-3

As you can see in the preceding results, they are not bad for content recommender system with little to no historical data. As you can see the first few movies returned are also famous animation movies like Toy Story and they are Aladdin, Winnie the Pooh, Pinocchio, and so on. So our little content management system we saw earlier is relatively okay. You can do many more things to make it even better by trying out more properties to compare with and using a different similarity coefficient such as Pearson Coefficient, Jaccard Distance, and so on.

From the article, we learned how content recommenders can be built on zero to no historical data. These recommenders are based on the attributes present on the item itself using which we figure out the similarity with other items and recommend them.

To know more about data analysis, data visualization and machine learning techniques you can refer to the book Big Data Analytics with Java.