8 min read

In the last post, Mahout on Spark: Recommenders we talked about creating a co-occurrence indicator matrix for a recommender using Mahout. The goals for a recommender are many, but first they must be fast and must make “good” personalized recommendations. We’ll start with the basics and improve on the “good” part as we go. As we saw last time, co-occurrence or item-based recommenders are described by:

rp = hp[AtA]

Calculating [AtA]

We needed some more interesting data first so I captured video preferences by mining the web. The target demo would be a Guide to Online Video. Input was collected for many users by simply logging their preferences to CSV text files:



The technique for actually using multiple actions hasn’t been described yet so for now we’ll use the dislikes in the application to filter out recommendations and use only the likes to calculate recommendations. That means we need to use only the “like” preferences. The Mahout 1.0 version of spark-itemsimilarity can read these files directly and filter out all but the lines with “like” in them:

mahout spark-itemsimilarity –i root-input-dir 
-o output-dir 
--filter1 like -fc 1 –ic 2  
#filter out all but likes 
#indicate columns for filter and items 

This will give us output like this:

the_hunger_games_catching_fire<tab>holiday_inn 2_guns superman_man_of_steel five_card_stud district_13 blue_jasmine this_is_the_end riddick ...

law_abiding_citizen<tab>stone ong_bak_2 the_grey american centurion edge_of_darkness orphan hausu buried ...

munich<tab>the_devil_wears_prada brothers marie_antoinette singer brothers_grimm apocalypto ghost_dog_the_way_of_the_samurai ...

private_parts<tab>puccini_for_beginners finding_forrester anger_management small_soldiers ice_age_2 karate_kid magicians ...

world-war-z<tab>the_wolverine the_hunger_games_catching_fire ghost_rider_spirit_of_vengeance holiday_inn the_hangover_part_iii ...

This is a tab delimited file with a video itemID followed by a string containing a list of similar videos that is space delimited. The similar video list may contain some surprises because here “similarity” means “liked by similar people”. It doesn’t mean the videos were similar in content or genre, so don’t worry if they look odd. We’ll use another technique to make “on subject” recommendations later.

Anyone familiar with the Mahout first generation recommender will notice right away that we are using IDs that have meaning to the application, whereas before Mahout required its own integer IDs.

A fast, scalable similarity engine

In Mahout’s first generation recommenders, all recs were calculated for all users. This meant that new users would have to wait for a long running batch job to happen before they saw recommendations. In the Guide demo app, we want to make good recommendations to new users and use new preferences in real time. We have already calculated [AtA] indicating item similarities so we need a real-time method for the final part of the equation rp = hp[AtA]. Capturing hp is the first task and in the demo we log all actions to a database in real time. This may have scaling issues but is fine for a demo.

Now we will make use of the “multiply as similarity” idea we introduced in the first post. Multiplying hp[AtA] can be done with a fast similarity engine—otherwise known as a search engine. At their core, search engines are primarily similarity engines that index textual data (AKA a matrix of token vectors) and take text as the query (a token vector). Another way to look at this is that search engines find by example—they are optimized to find a collection of items by similarity to the query. We will use the search engine to find the most similar indicator vectors in [AtA] to our query hp, thereby producing rp. Using this method, rp will be the list of items returned from the search—row IDs in [AtA].

Spark-itemsimilarity is designed to create output that can be directly indexed by search engines. In the Guide demo we chose to create a catalog of items in a database and to use Solr to index columns in the database. Both Solr and Elasticsearch have highly scalable fast engines that can perform searches on database columns or a variety of text formats so you don’t have to use a database to store the indicators.

We loaded the indicators along with some metadata about the items into the database like this:

itemID foreign-key      genres        indicators
123    world-war-z      sci-fi action the_wolverine …
456    captain_phillips action drama  pi when_com…

So, the foreign-key is our video item ID from the indicator output and the indicator is the space-delimited list of similar video item IDs.

We must now set the search engine to index the indicators. This integration is usually pretty simple and depends on what database you use or if you are storing the entire catalog in files (leaving the database out of the loop). Once you’ve triggered the indexing of your indicators, we are ready for the query. The query will be a preference history vector consisting of the same tokens/video item IDs you see in the indicators. For a known user these should be logged and available, perhaps in the database, but for a new user we’ll have to find a way to encourage preference feedback.

New users

The demo site asks a new user to create an account and run through a trainer that collects important preferences from the user. We can probably leave the details of how to ask for “important” preferences for later. Suffice to say, we clustered items and took popular ones from each cluster so that the users were more likely to have seen them.

Mohout film recommendation

From this we see that the user liked:

argo django iron_man_3 pi looper …

Whether you are responding to a new user or just accounting for the most recent preferences of returning users, recentness is very important. Using the previous IDs as a query on the indicator field of the database returns recommendations, even though the new user’s data was not used to train the recommender. Here’s what we get:

Mohout film recommendation

The first line shows the result of the search engine query for the new user. The trainer on the demo site has several pages of examples to rate and the more you rate the better the recommendations become, as one would expect but these look pretty good given only 9 ratings. I can make a value judgment because they were rated by me. In a small sampling of 20 people using the site and after having them complete the entire 20 pages of training examples, we asked them to tell us how many of the recommendations on the first line were things they liked or would like to see. We got 53-90% right. Only a few people participated and your data will vary greatly but this was at least some validation.

The second line of recommendations and several more below it are calculated using a genre and this begins to show the power of the search engine method. In the trainer I picked movies where the number 1 genre was “drama”. If you have the search engine index both indicators as well as genres you can combine indicator and genre preferences in the query.

To produce line 1 the query was:

indicator field: “argo django iron_man_3 pi looper …”

To produce line 2 the query was:

indicator field: “argo django iron_man_3 pi looper …”
genre field: “drama”; boost: 5

The boost is used to skew results towards a field. In practice this will give you mostly matching genres but is not the same as a filter, which can also be used if you want a guarantee that the results will be from “drama”.


Combining a search engine with Mahout created a recommender that is extremely fast and scalable but also seamlessly blends results using collaborative filtering data and metadata. Using metadata boosting in the query allows us to skew results in a direction that makes sense. 

Using multiple fields in a single query gives us even more power than this example shows. It allows us to mix in different actions. Remember the “dislike” action that we discarded? One simple and reasonable way to use that is to filter results by things the user disliked, and the demo site does just that. But we can go even further; we can use dislikes in a cross-action recommender. Certain of the user’s dislikes might even predict what they will like, but that requires us to go back to the original equation so we’ll leave it for another post. 

About the author

Pat is a serial entrepreneur, consultant, and Apache Mahout committer working on the next generation of Spark-based recommenders. He lives in Seattle and can be contacted through his site at https://finderbots.com or @occam on Twitter.


Please enter your comment!
Please enter your name here