How to Build a Recommender by Running Mahout on Spark

0
3381
6 min read

Mahout on Spark: Recommenders

There are big changes happening in Apache Mahout. For several years it was the go-to machine learning library for Hadoop. It contained most of the best-in-class algorithms for scalable machine learning, which means clustering, classification, and recommendations. But it was written for Hadoop and MapReduce. Today a number of new parallel execution engines show great promise in speeding calculations by 10-100x (Spark, H2O, Flink). That means that instead of buying 10 computers for a cluster, one will do. That should get your manager’s attention.

After releasing Mahout 0.9, the team decided to begin an aggressive retool using Spark, building in the flexibility to support other engines, and both H2O and Flink have shown active interest.

This post is about moving the heart of Mahout’s item-based collaborative filtering recommender to Spark.

Where we are

Mahout is currently on the 1.0 snapshot version, meaning we are working on what will be released as 1.0. For the past year or, some of the team has been working on a Scala-based DSL (Domain Specific Language), which looks like Scala with R-like algebraic expressions. Since Scala supports not only operator overloading but functional programming, it is a natural choice for building distributed code with rich linear algebra expressions. Currently we have an interactive shell that runs Scala with all of the R-like expression support. Think of it as R but supporting truly huge data in a completely distributed way.


Many algorithms—the ones that can be expressed as simple linear algebra equations—are implemented with relative ease (SSVD, PCA). Scala also has lazy evaluation, which allows Mahout to slide a modern optimizer underneath the DSL. When an end product of a calculation is needed, the optimizer figures out the best path to follow and spins off the most efficient Spark jobs to accomplish the whole.

Recommenders

One of the first things we want to implement is the popular item-based recommenders. But here, too, we’ll introduce many innovations. It still starts from some linear algebra. Let’s take the case of recommending purchases on an e-commerce site. The problem can be defined in the following code example:

r_p = recommendations for purchases for a given user. This is a vector of item-ids and strengths of recommendation.

h_p = history of purchases for a given user

A = the matrix of all purchases by all users. Rows are users, columns items, for now we will just flag a purchase so the matrix is all ones and zeros.

r_p = [A’A]h_p

A’A is the matrix A transposed then multiplied by A. This is the core cooccurrence or indicator matrix used in this style of recommender. Using the Mahout Scala DSL we could write the recommender as:

val recs = (A.t %*% A) * userHist

This would produce a reasonable recommendation, but from experience we know that A’A is better calculated using a method called the Log Likelihood Ratio, which is a probabilistic measure of the importance of a cooccurrence (http://en.wikipedia.org/wiki/Likelihood-ratio_test). In general, when you see something like A’A, it can be replaced with a similarity comparison for each row with every other row. This will produce a matrix whose rows are items and whose columns are the same items. The magnitude of the value in the matrix determines the strength of similarity of row item to the column item. In recommenders, the more similar the items, the more they were purchased by similar people.

The previous line of code is replaced by the following:

val recs = CooccurrenceAnalysis.cooccurence(A)* 
userHist

However, this would take time to execute for each user as they visit the e-commerce site, so we’ll handle that outside of Mahout. First let’s talk about data preparation.

Item Similarity Driver

Creating the indicator matrix (A’A) is the core of this type of recommender. We have a quick flexible way to create this using text log files and creating output that is in an easy form to digest. The job of data prep is greatly streamlined in the Mahout 1.0 snapshot. In the past a user would have to do all the data prep themselves. This requires translating their own user and item IDs into Mahout IDs, putting the data into text tuple files and feeding them to the recommender. Out the other end you’d get a Hadoop binary file called a sequence file and you’d have to translate the Mahout IDs into something your application could understand. This is no longer required.

To make this process much simpler we created the spark-itemsimilarity command line tool. After installing Mahout, Hadoop, and Spark, and assuming you have logged user purchases in some directories in HDFS, we can probably read them in, calculate the indicator matrix, and write it out with no other prep required. The spark-itemsimilarity command line takes in text-delimited files, extracts the user ID and item ID, runs the cooccurrence analysis, and outputs a text file with your application’s user and item IDs restored.

Here is the sample input file where we’ve specified a simple comma-delimited format that field holds the user ID, the item ID, and the filter—purchase:

Thu Jul 10 10:52:10.996,u1,purchase,iphone
Fri Jul 11 13:52:51.018,u1,purchase,ipad
Fri Jul 11 21:42:26.232,u2,purchase,nexus
Sat Jul 12 09:43:12.434,u2,purchase,galaxy
Sat Jul 12 19:58:09.975,u3,purchase,surface
Sat Jul 12 23:08:03.457,u4,purchase,iphone
Sun Jul 13 14:43:38.363,u4,purchase,galaxy

spark-itemsimilarity will create a Spark distributed dataset (rdd) to back the Mahout DRM (distributed row matrix) that holds this data:

User/item

iPhone

IPad

Nexus

Galaxy

Surface

u1

1

1

0

0

0

u2

0

0

1

1

0

u3

0

0

0

0

1

u4

1

0

0

1

0

The output of the job is the LLR computed “indicator matrix” and will contain this data:

Item/item

iPhone

iPad

Nexus

Galaxy

Surface

iPhone

0

1.726092435

0

0

0

iPad

1.726092435

0

0

0

0

Nexus

0

0

0

1.726092435

0

Galaxy

0

0

1.726092435

0

0

Surface

0

0

0

0

0

Reading this, we see that self-similarities have been removed so the diagonal is all zeros. The iPhone is similar to the iPad and the Galaxy is similar to the Nexus. The output of the spark-itemsimilarity job can be formatted in various ways but by default it looks like this:

galaxy<tab>nexus:1.7260924347106847
ipad<tab>iphone:1.7260924347106847
nexus<tab>galaxy:1.7260924347106847
iphone<tab>ipad:1.7260924347106847
surface

On the e-commerce site for the page displaying the Nexus we can show that the Galaxy was purchased by the same people. Notice that application specific IDs are preserved here and the text file is very easy to parse in text-delimited format. The numeric values are the strength of similarity, and for the cases where there are many similar products, you can sort on that value if you want to show only the highest weighted recommendations.

Still, this is only part way to an individual recommender. We have done the [A’A] part, but now we need to do the [A’A]h_p. Using the current user’s purchase history will personalize the recommendations.

The next post in this series will talk about using a search engine to take this last step.

About the author

Pat is a serial entrepreneur, consultant, and Apache Mahout committer working on the next generation of Spark-based recommenders. He lives in Seattle and can be contacted through his site https://finderbots.com or @occam on Twitter.

Want more Spark? We’ve got you covered – click here.

LEAVE A REPLY

Please enter your comment!
Please enter your name here