16 min read

In the article Prabhanjan Tattar, author of book Practical Data Science Cookbook – Second Edition, explainsPython is an interpreted language (sometimes referred to as a scripting language), much like R. It requires no special IDE or software compilation tools and is therefore as fast as R to develop with and prototype. Like R, it also makes use of C shared objects to improve computational performance. Additionally, Python is a default system tool on Linux, Unix, and Mac OS X machines and is available on Windows. Python comes with batteries included, which means that the standard library is widely inclusive of many modules, from multiprocessing to compression toolsets. Python is a flexible computing powerhouse that can tackle any problem domain. If you find yourself in need of libraries that are outside of the standard library, Python also comes with a package manager (like R) that allows the download and installation of other code bases.

(For more resources related to this topic, see here.)

Python’s computational flexibility means that some analytical tasks take more lines of code than their counterpart in R. However, Python does have the tools that allow it to perform the same statistical computing. This leads to an obvious question: When do we use R over Python and vice versa? This article attempts to answer this question by taking an application-oriented approach to statistical analyses.

From books to movies to people to follow on Twitter, recommender systems carve the deluge of information on the Internet into a more personalized flow, thus improving the performance of e-commerce, web, and social applications. It is no great surprise, given the success of Amazon-monetizing recommendations and the Netflix Prize, that any discussion of personalization or data-theoretic prediction would involve a recommender. What is surprising is how simple recommenders are to implement yet how susceptible they are to vagaries of sparse data and overfitting.

Consider a non-algorithmic approach to eliciting recommendations; one of the easiest ways to garner a recommendation is to look at the preferences of someone we trust. We are implicitly comparing our preferences to theirs, and the more similarities you share, the more likely you are to discover novel, shared preferences. However, everyone is unique, and our preferences exist across a variety of categories and domains. What if you could leverage the preferences of a great number of people and not just those you trust? In the aggregate, you would be able to see patterns, not just of people like you, but also anti-recommendations— things to stay away from, cautioned by the people not like you. You would, hopefully, also see subtle delineations across the shared preference space of groups of people who share parts of your own unique experience.

Understanding the data

Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine.

Getting ready

To prepare for this recipe, and the rest of the article, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/.

In this recipe, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease.

How to do it…

Perform the following steps to better understand the data that we will be working with throughout:

  1. Download the data from http://grouplens.org/datasets/movielens/.The 100K dataset is the one that you want (ml-100k.zip):

  2. Unzip the downloaded data into the directory of your choice.
  3. The two files that we are mainly concerned with are u.data, which contains the user movie ratings, and u.item, which contains movie information and details. To get a sense of each file, use the head command at the command prompt for Mac and Linux or the more command for Windows:
    head -n 5 u.item

    Note that if you are working on a computer running the Microsoft Windows operating system and not using a virtual machine (not recommended), you do not have access to the head command; instead, use the following command:

    moreu.item 2 n
  4. The preceding command gives you the following output:
    1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
    2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
    3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
    4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
    5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
    
  5. The following command will produce the given output:
    head -n 5 u.data
  6. For Windows, you can use the following command:
    moreu.item 2 n
    196  242  3  881250949
    186  302  3  891717742
    22  377  1  878887116
    244  51  2  880606923
    166  346  1  886397596

How it works…

The two main files that we will be using are as follows:

  • u.data: This contains the user moving ratings
  • u.item: This contains the movie information and other details

Both are character-delimited files; u.data, which is the main file, is tab delimited, and u.item is pipe delimited.

For u.data, the first column is the user ID, the second column is the movie ID, the third is the star rating, and the last is the timestamp. The u.item file contains much more information, including the ID, title, release date, and even a URL to IMDB. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western.

There’s more…

Free, web-scale datasets that are appropriate for building recommendation engines are few and far between. As a result, the movie lens dataset is a very popular choice for such a task but there are others as well. The well-known Netflix Prize dataset has been pulled down by Netflix. However, there is a dump of all user-contributed content from the Stack Exchange network (including Stack Overflow) available via the Internet Archive (https://archive.org/details/stackexchange). Additionally, there is a book-crossing dataset that contains over a million ratings of about a quarter million different books (http://www2.informatik.uni-freiburg.de/~cziegler/BX/).

Ingesting the movie review data

Recommendation engines require large amounts of training data in order to do a good job, which is why they’re often relegated to big data projects. However, to build a recommendation engine, we must first get the required data into memory and, due to the size of the data, must do so in a memory-safe and efficient way. Luckily, Python has all of the tools to get the job done, and this recipe shows you how.

Getting ready

You will need to have the appropriate movie lens dataset downloaded, as specified in the preceding recipe. If you skipped the setup in you will need to go back and ensure that you have NumPy correctly installed.

How to do it…

The following steps guide you through the creation of the functions that we will need in order to load the datasets into the memory:

  1. Open your favorite Python editor or IDE. There is a lot of code, so it should be far simpler to enter directly into a text file than Read-Eval-Print Loop (REPL).
  2. We create a function to import the movie reviews:
    In [1]: import csv
       ...: import datetime
    In [2]: defload_reviews(path, **kwargs):
       ...: “““
       ...: Loads MovieLens reviews
       ...: “““
       ...: options = {
       ...: ‘fieldnames’: (‘userid’, ‘movieid’, ‘rating’, ‘timestamp’),
       ...: ‘delimiter’: ‘t’,
       ...: }
       ...: options.update(kwargs)
       ...: 
       ...: parse_date = lambda r,k: datetime.fromtimestamp(float(r[k]))
       ...: parse_int = lambda r,k: int(r[k])
       ...: 
       ...: with open(path, ‘rb’) as reviews:
       ...: reader = csv.DictReader(reviews, **options)
       ...: for row in reader:
       ...: row[‘movieid’] = parse_int(row, ‘movieid’)
       ...: row[‘userid’] = parse_int(row, ‘userid’)
       ...: row[‘rating’] = parse_int(row, ‘rating’)
       ...: row[‘timestamp’] = parse_date(row, ‘timestamp’)
       ...: yield row
  3. We create a helper function to help import the data:
    In  [3]: import os
        ...: defrelative_path(path):
        ...: “““
        ...: Returns a path relative from this code file
        ...: “““
        ...: dirname = os.path.dirname(os.path.realpath(‘__file__’))
        ...: path = os.path.join(dirname, path)
        ...: return os.path.normpath(path)
  4.   We create another function to load the movie information:
    In  [4]: defload_movies(path, **kwargs):
        ...: 
        ...: options = {
        ...: ‘fieldnames’: (‘movieid’, ‘title’, ‘release’, ‘video’, ‘url’),
        ...: ‘delimiter’: ‘|’,
        ...: ‘restkey’: ‘genre’,
        ...: }
        ...: options.update(kwargs)
        ...: 
        ...: parse_int = lambda r,k: int(r[k])
        ...: parse_date = lambda r,k: datetime.strptime(r[k], ‘%d-%b-%Y’) if r[k] else None
        ...: 
        ...: with open(path, ‘rb’) as movies:
        ...: reader = csv.DictReader(movies, **options)
        ...: for row in reader:
        ...: row[‘movieid’] = parse_int(row, ‘movieid’)
        ...: row[‘release’] = parse_date(row, ‘release’)
        ...: row[‘video’] = parse_date(row, ‘video’)
        ...: yield row
  5. Finally, we start creating a MovieLens class that will be augmented later :

    In  [5]: from collections import defaultdict
    
    In  [6]: class MovieLens(object):
        ...: “““
        ...: Data structure to build our recommender model on.
        ...: “““
        ...: 
        ...: def __init__(self, udata, uitem):
        ...: “““
        ...: Instantiate with a path to u.data and u.item
        ...: “““
        ...: self.udata = udata
        ...: self.uitem = uitem
        ...: self.movies = {}
        ...: self.reviews = defaultdict(dict)
        ...: self.load_dataset()
        ...: 
        ...: defload_dataset(self):
        ...: “““
        ...: Loads the two datasets into memory, indexed on the ID.
        ...: “““
        ...: for movie in load_movies(self.uitem):
        ...: self.movies[movie[‘movieid’]] = movie
        ...: 
        ...: for review in load_reviews(self.udata):
        ...: self.reviews[review[‘userid’]][review[‘movieid’]] = review
    
  6. Ensure that the functions have been imported into your REPL or the IPython workspace, and type the following, making sure that the path to the data files is appropriate for your system:

    In  [7]: data = relative_path(‘../data/ml-100k/u.data’)
        ...: item = relative_path(‘../data/ml-100k/u.item’)
        ...: model = MovieLens(data, item)
    

How it works…

The methodology that we use for the two data-loading functions (load_reviews and load_movies) is simple, but it takes care of the details of parsing the data from the disk. We created a function that takes a path to our dataset and then any optional keywords. We know that we have specific ways in which we need to interact with the csv module, so we create default options, passing in the field names of the rows along with the delimiter, which is t. The options.update(kwargs) line means that we’ll accept whatever users pass to this function.

We then created internal parsing functions using a lambda function in Python. These simple parsers take a row and a key as input and return the converted input. This is an example of using lambda as internal, reusable code blocks and is a common technique in Python. Finally, we open our file and create a csv.DictReader function with our options. Iterating through the rows in the reader, we parse the fields that we want to be int and datetime, respectively, and then yield the row.

Note that as we are unsure about the actual size of the input file, we are doing this in a memory-safe manner using Python generators. Using yield instead of return ensures that Python creates a generator under the hood and does not load the entire dataset into the memory.

We’ll use each of these methodologies to load the datasets at various times through our computation that uses this dataset. We’ll need to know where these files are at all times, which can be a pain, especially in larger code bases; in the There’s more… section, we’ll discuss a Python pro-tip to alleviate this concern.

Finally, we created a data structure, which is the MovieLens class, with which we can hold our reviews’ data. This structure takes the udata and uitem paths, and then, it loads the movies and reviews into two Python dictionaries that are indexed by movieid and userid, respectively. To instantiate this object, you will execute something as follows:

In  [7]: data = relative_path(‘../data/ml-100k/u.data’)
    ...: item = relative_path(‘../data/ml-100k/u.item’)
    ...: model = MovieLens(data, item)

Note that the preceding commands assume that you have your data in a folder called data. We can now load the whole dataset into the memory, indexed on the various IDs specified in the dataset.

Did you notice the use of the relative_path function? When dealing with fixtures such as these to build models, the data is often included with the code. When you specify a path in Python, such as data/ml-100k/u.data, it looks it up relative to the current working directory where you ran the script. To help ease this trouble, you can specify the paths that are relative to the code itself:

importos
defrelative_path(path):
“““
     Returns a path relative from this code file
“““
dirname = os.path.dirname(os.path.realpath(‘__file__’))
path = os.path.join(dirname, path)
returnos.path.normpath(path)

Keep in mind that this holds the entire data structure in memory; in the case of the 100k dataset, this will require 54.1 MB, which isn’t too bad for modern machines. However, we should also keep in mind that we’ll generally build recommenders using far more than just 100,000 reviews. This is why we have configured the data structure the way we have—very similar to a database. To grow the system, you will replace the reviews and movies properties with database access functions or properties, which will yield data types expected by our methods.

Finding the highest-scoring movies

If you’re looking for a good movie, you’ll often want to see the most popular or best rated movies overall. Initially, we’ll take a naïve approach to compute a movie’s aggregate rating by averaging the user reviews for each movie. This technique will also demonstrate how to access the data in our MovieLens class.

Getting ready

These recipes are sequential in nature. Thus, you should have completed the previous recipes in the article before starting with this one.

How to do it…

Follow these steps to output numeric scores for all movies in the dataset and compute a top-10 list:

  1. Augment the MovieLens class with a new method to get all reviews for a particular movie:
    In  [8]: class MovieLens(object):
        ...: 
        ...: 
        ...: defreviews_for_movie(self, movieid):
        ...: “““
        ...: Yields the reviews for a given movie
        ...: “““
        ...: for review in self.reviews.values():
        ...: if movieid in review:
        ...: yield review[movieid]
        ...:
    
  2. Then, add an additional method to compute the top 10 movies reviewed by users:
    In [9]: import heapq
        ...: from operator import itemgetter
        ...: class MovieLens(object):
        ...: 
        ...: defaverage_reviews(self):
        ...: “““
        ...: Averages the star rating for all movies. Yields a tuple of movieid,
        ...: the average rating, and the number of reviews.
        ...: “““
        ...: for movieid in self.movies:
        ...: reviews = list(r[‘rating’] for r in self.reviews_for_movie(movieid))
        ...: average = sum(reviews) / float(len(reviews))
        ...: yield (movieid, average, len(reviews)) 
        ...: 
        ...: deftop_rated(self, n=10):
        ...: “““
        ...: Yields the n top rated movies
        ...: “““
        ...: return heapq.nlargest(n, self.bayesian_average(), key=itemgetter(1))
        ...: 

    Note that the notation just below class MovieLens(object): signifies that we will be appending the average_reviews method to the existing MovieLens class.

  3. Now, let’s print the top-rated results:
    In [10]: for mid, avg, num in model.top_rated(10):
        ...: title = model.movies[mid][‘title’]
        ...: print “[%0.3f average rating (%i reviews)] %s” % (avg, num,title)
  4. Executing the preceding commands in your REPL should produce the following output:
    Out [10]: [5.000 average rating (1 reviews)] Entertaining Angels: The 
    Dorothy Day Story (1996)
     [5.000 average rating (2 reviews)] Santa with Muscles (1996)
     [5.000 average rating (1 reviews)] Great Day in Harlem, A (1994)
     [5.000 average rating (1 reviews)] They Made Me a Criminal (1939)
     [5.000 average rating (1 reviews)] Aiqingwansui (1994)
     [5.000 average rating (1 reviews)] Someone Else’s America (1995)
     [5.000 average rating (2 reviews)] Saint of Fort Washington, 
    The (1993)
     [5.000 average rating (3 reviews)] Prefontaine (1997)
     [5.000 average rating (3 reviews)] Star Kid (1997)
     [5.000 average rating (1 reviews)] Marlene Dietrich: Shadow 
    and Light (1996)

How it works…

The new reviews_for_movie() method that is added to the MovieLens class iterates through our review dictionary values (which are indexed by the userid parameter), checks whether the movieid value has been reviewed by the user, and then presents that review dictionary. We will need such functionality for the next method.

With the average_review() method, we have created another generator function that goes through all of our movies and all of their reviews and presents the movie ID, the average rating, and the number of reviews. The top_rated function uses the heapq module to quickly sort the reviews based on the average.

The heapq data structure, also known as the priority queue algorithm, is the Python implementation of an abstract data structure with interesting and useful properties. Heaps are binary trees that are built so that every parent node has a value that is either less than or equal to any of its children nodes. Thus, the smallest element is the root of the tree, which can be accessed in constant time, which is a very desirable property. With heapq, Python developers have an efficient means to insert new values in an ordered data structure and also return sorted values.

There’s more…

Here, we run into our first problem—some of the top-rated movies only have one review (and conversely, so do the worst-rated movies). How do you compare Casablanca, which has a 4.457 average rating (243 reviews), with Santa with Muscles, which has a 5.000 average rating (2 reviews)? We are sure that those two reviewers really liked Santa with Muscles, but the high rating for Casablanca is probably more meaningful because more people liked it. Most recommenders with star ratings will simply output the average rating along with the number of reviewers, allowing the user to determine their quality; however, as data scientists, we can do better in the next recipe.

See also

We have thus pointed out that companies such as Amazon track purchases and page views to make recommendations, Goodreads and Yelp use 5 star ratings and text reviews, and sites such as Reddit or Stack Overflow use simple up/down voting. You can see that preference can be expressed in the data in different ways, from Boolean flags to voting to ratings. However, these preferences are expressed by attempting to find groups of similarities in preference expressions in which you are leveraging the core assumption of collaborative filtering.

More formally, we understand that two people, Bob and Alice, share a preference for a specific item or widget. If Alice too has a preference for a different item, say, sprocket, then Bob has a better than random chance of also sharing a preference for a sprocket. We believe that Bob and Alice’s taste similarities can be expressed in an aggregate via a large number of preferences, and by leveraging the collaborative nature of groups, we can filter the world of products.

Summary

In the recipes we learned various ways for understanding data and finding highest scoring reviews using IPython. 

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here