Surviving the Titanic

This post is about applying scikit-learn’s DecisionTreeClassifier to the Titanic passenger dataset. My original motivation was to find a simple ruleset for the first challenge of Udacity’s Intro to Data Science course: Given a list of properties of a passenger, predict whether he survived the Titanic catastrophe or not. Reach a prediction accuracy of 80% or higher.

If you want to follow along, you need to have the SciPy stack installed (I’m on OS X Mavericks using the Anaconda distribution). We’ll also use graphviz to visually inspect our decision trees.

The data set we’ll work with is the one provided by kaggle (the train set).

Preparing the data

With everything set-up, let’s start by building our first decision tree. On the command line, start ipython. Then, use pandas to load the titanic data into a DataFrame.

import pandas as pd
data = pd.read_csv("titanic_data.csv")

Next is cleaning up, meaning removing columns that we don’t want the classifier to consider and getting rid of NaNs (which pandas uses to indicate missing values):

> len(data)
> data = data.drop([‘PassengerId’, ‘Name’, ‘Ticket’, ‘Cabin’], axis = 1)
> data = data.dropna()
> len(data)
> data.columns
Index([u'Survived', u'Pclass', u'Sex', u'Age', u'SibSp', u'Parch', u'Fare', u'Embarked'], dtype=object)

With data.dropna() we did a brute force cleaning; all rows containing at least one NaN entry were removed from the DataFrame. This results in us throwing away about 20% of our data. We don’t care much about that here, but in real life, you should handle missing data a bit smarter (e.g., you could try to predict missing values).

The final thing we need to do before we can learn our first decision tree is to transform the categorical values of ‘Sex’ and ‘Embarked’ into corresponding integer values.

> data[‘Sex’] = pd.Categorical.from_array(data[‘Sex’]).labels
> data[‘Embarked’] = pd.Categorical.from_array(data[‘Embarked]).labels
> data.head()

Fitting a DecisionTreeClassifier

In order to to learn the decision tree, we split our frame in a target-value vector y (‘Survived’) and a feature vector X (all other columns) and fit an instance of DecisionTreeClassifier:

> from sklearn import tree
> clf = tree.DecisionTreeClassifier()
> y = data[‘Survived’]
> X = data.drop(‘Survived’, axis = 1)
> clf =, y)

clf is now a decision tree fitted to our training data. scikit-learn can export the learned tree into the dot format used by graphviz for visual inspection:

> with open(“”, “w”) as f:
 tree.export_graphviz(clf, out_file=f, feature_names=X.columns)

On the command line, convert the dot file into a pdf and open it in your favorite pdf viewer:

dot -Tpdf -o titanic_1.pdf
open titanic_1.pdf

You’ll notice that we ended up with quite a big tree.


Here’s how you read the visualization:



Evaluating the model

Let’s see how the tree performs on predicting the survival values on the training data:

> clf.score(X,y)

So we got 99% accuracy. That’s nice, but it’s rather unlikely that we’re going to see this level of accuracy on unseen data. Looking at the depth of the tree and the fine-grained decisions on the lower levels, it’s almost certain that our tree is overfitted. Let’s make some tests to determine whether we’re right.

First we do a cross-validation to get an idea of what accuracy to expect on unseen data given our default DecisionTreeClassifier as the predictor of choice:

> from sklearn.cross_validation import cross_val_score
> cross_val_score(tree.DecisionTreeClassifier(), X, y, cv=10).mean()

That’s quite a bit lower and even worse than simple classification heuristics such as “assume all females survived and all males died”. The modeling approach we’ve chosen is less than optimal.

Is the reason overfitting? Or is it a bias inherent in the model? Let’s look at the learning curve to see what’s happening:


We see that the training error stays low when we add more training examples, while the test error remains quite a bit higher – the typical pattern of an overfitted model.

Pruning the tree – the manual way

Approaches to mitigate overfitting include removing features from the data and limiting model complexity. Here, we’ll go with the latter as we’ve seen that the current model is quite complex. So there’s lots of room for improvement.

scikit-learn provides several options for limiting the maximum complexity of a fitted DecisionTreeClassifier:

  • Restrict the maximum depth of the tree.
  • Set the minimum number of samples required to be a leaf node.
  • Set the minimum number of samples required to split an internal node.

We’ll explore limiting the maximum depth with the goal of finding the shallowest model that’s still good enough (expected accuracy >= 80%). Here’s a plot of the classification error with increasing tree depth:


Looks like a depth between 3 and 5 should turn out to be a reasonable compromise between complexity and accuracy. Let’s do the cross validations:

> cross_validation_score(tree.DecisionTreeClassifier(max_depth=2), X, y, cv=10).mean()
> cross_validation_score(tree.DecisionTreeClassifier(max_depth=3), X, y, cv=10).mean()
> cross_validation_score(tree.DecisionTreeClassifier(max_depth=4), X, y, cv=10).mean()
> cross_validation_score(tree.DecisionTreeClassifier(max_depth=5), X, y, cv=10).mean()

The tree of depth 3 looks like this:


That’s simple enough for our purposes. Rewritten as an if-then-else statement, the tree becomes:

if passenger['Sex'] == "female":
    p = passenger['Pclass'] <= 2 or passenger['Fare'] <= 20.8
    p = passenger['Age'] <= 6.5 and passenger['SibSp'] <= 2.5
predictions[passenger['PassengerId']] = p

This simple ruleset reaches an accuracy of 82.6% in the Udacity challenge — far above the required minimum of 80%.


Tutorial: Exploring Recommendation Algorithms

Wow, took me quite some time to sit down and actually write the tutorial I had in mind almost one year ago. Kind of embarrassing, but hey, I eventually got around to do it. So without further ado, let’s get into the tutorial and have some fun with recommendation algorithms.

What we will do

We’ll write a simple movie recommendation application that lets you experience the behavior of the algorithms first hand. Granted, recommending movies isn’t that exciting as everybody does it, but we’ll find suitable datasets to train and evaluate our algorithms on-line and everyone following along should be able to  subjectively evaluate the result of our application.

We’ll implement, evaluate, and use several different recommendation algorithms:

  1. user-based collaborative filtering (the original recommendation algorithm)
  2. item-based collaborative filtering (introduced and used in the late 1990’s by Amazon)
  3. matrix-factorization (really successful in the Netflix challenge)

In this first part of the tutorial, we’ll get everything set-up and implement a first version of our application using a straightforward user-based collaborative filtering recommender.

Introducing the Movie Buddy

Let’s talk a little bit about the application we’re going to write: the Movie Buddy. Movie Buddy is there to help you pick a movie you’ll hopefully enjoy for the evening, given your previous movie preferences in the form of a set of movies you like.

The UI will allow you to create and maintain a list of loved movies, and see a list of corresponding recommended movies. Here’s a sketch:



I’m using Python 2.7 throughout this tutorial. Python is available for most environments, code is usually compact and easy to read, turn-around times are short, and you don’t need a heavy-weight IDE to get stuff done (vi works fine!). In short, I find Python great for exploring ideas.

Collaborative approaches rely on users’ historical preferences to generate recommendations. Movie Buddy will use the popular movie-lens dataset. I’ll use the 1 million ratings dataset in my examples.

Make sure you’ve got Python installed and a copy of the movie-lens dataset downloaded if you want to follow along.

You can download all sources of this tutorial from bitbucket. I’m not allowed to redistribute the movie-lens dataset, so you’ve got to fetch it from their web-site separately.

Project Structure

I decided to split the movie recommender into three components:

  • A module that provides the UI independent of any recommendation algorithm
  • A module that provides the implementations of the recommendation algorithms
  • The driver that brings the UI and the recommendaton algorithms together

The reasoning behind this is that we’re going to explore different recommendation algorithms in the movie recommendation context, so it should be easy to swap one algorithm with another. That’s why the algorithms are separated from the UI, and both are stitched together by the driver.

Preparing the data

The movie-lens dataset contains ratings on a scale from 1 to 5. Movie Buddy supports only unary ratings: Either the movie is in the set of loved movies, or not. I find it more frequent to have only unary ratings (likes) available, so I’ll start with exploring this scenario.

This means we’ve got to transform the 1-to-5 ratings into unary ratings or lists of votes: For each user, we want to generate a list of movies he likes. We do this by first computing the user’s average movie rating, and then adding all movies rated above this average to his list.

I’m using the average rating here instead of a global threshold because the distribution of ratings tends to be different for each user. Some only use the 3-5 part of the scale, rating movies they didn’t like with a 3 and great ones with a 5. Others utilize the entire scale.

Here’s a Python script that creates the lists of votes for each user:

"""Transforms movielense integer ratings into boolean ratings."""

import argparse
import csv

# relevant columns in the movielens file

def parse_args():
    parser = argparse.ArgumentParser(description=
        "transforms movielens integer ratings into boolean ratings")
    parser.add_argument("input", help="name of the input data file")
    parser.add_argument("output", help="name of the output data file")
    return parser.parse_args()

def read_ratings_from(csvfile):
    result = dict()
    data_reader = csv.reader(csvfile,delimiter=":")
    for row in data_reader:
        result.setdefault(int(row[USERID]), list()).append(

(int(row[MOVIE]), int(row[RATING])))
    return result

def transform_ratings(int_ratings):
    result = dict()
    for user,ratings in int_ratings.items():
        avg_rating = sum([r for m,r in ratings]) / float(len(ratings))
        pos_movies = [m for m,r in ratings if r > avg_rating]
    if len(pos_movies) > 0:
        result[user] = pos_movies
    return result

def write_ratings_to(bool_ratings, csvfile):
    writer = csv.writer(csvfile)
    for u,r in bool_ratings.items():
        writer.writerow([u] + r)

def main():
    args = parse_args()
    int_ratings = read_ratings_from(open(args.input, "rb"))
    bool_ratings = transform_ratings(int_ratings)
    write_ratings_to(bool_ratings,open(args.output, "wb"))

if __name__ == "__main__":

Apply this script to the ratings.dat file in the movie-lens dataset like so:
python ratings.dat votes.csv.

This will store the lists of votes in the file votes.csv.

The User Interface

I’ve written the user interface using tkinter, as it’s part of the default Python distribution and should thus be available in whatever environment you’re using – no need to install any additional stuff.

The recui module provides the following public methods:

  • init_ui: Builds-up the UI.
  • set_choices: Sets the entries of the choice list. The argument can be of any sequence type.
  • set_recs: Sets the entries of the recommendation list. The argument can be of any sequence type.
  • mainloop: Starts the UI event loop. Won’t return until the UI window is closed.

Obviously, if you call mainloop before setting up the UI, you won’t see anything.

""" A simple UI for the movie recommender. """

# Author: Eric Schwarzkopf <>
# License: BSD Style.

from Tkinter import *

# data

_select_callback = None
_root = Tk()
_choices = []
_choicesVar = StringVar()
_selected = []
_selectedVar = StringVar()
_recommendedVar = StringVar()
_choicelist = None
_selectedlist = None
_recommendedlist = None

# event callbacks

def _choice_selected(event):
    """ Called when an item in the choice list has been selected. """
    idxs = event.widget.curselection()
    if len(idxs) == 1:
        idx = int(idxs[0])
        c = _choices[idx]
        if not c in _selected:
            _choicelist.itemconfigure(idx, fg='grey')

def _root_keypressed(event):
    """ Called when a key has been pressed. """
    c = event.char
    for i in range(0, len(_choices)):
        if str(_choices[i]).lower().startswith(c):

def _selected_selected(event):
    """ Called when an item in the selected list has been selected. """
    idxs = event.widget.curselection()
    if len(idxs) == 1:
        c = _selected[int(idxs[0])]
        _choicelist.itemconfigure(_choices.index(c), fg='black')

# public functions

def set_choices(values):
    """ Sets the entries of the choice list. """
    global _choices
    _choices = values

def set_recs(values):
    """ Sets the entries of the recommended list. """
    global _recommended

def init_ui(title, select_callback):
    Initializes the UI.


    title : string
        The title of the UI window.

    select_callback : callable
        Called with the list of choices currently selected if
        this list changes
    global _select_callback, _choicelist, _selectedlist, _recommendedlist

    _select_callback = select_callback

    _root.columnconfigure(0, weight=1)
    _root.rowconfigure(0, weight=1)
    _root.bind('<Key>', _root_keypressed)

    mainframe = Frame(_root, padx=3, pady=12)
    mainframe.grid(column=0, row=0, sticky=(N,W,E,S))
    mainframe.columnconfigure(0, weight=1)
    mainframe.rowconfigure(0, weight=1)
    mainframe.columnconfigure(1, weight=1)
    mainframe.rowconfigure(0, weight=1)
    mainframe.columnconfigure(2, weight=1)
    mainframe.rowconfigure(0, weight=1)

    # the list of all movies
    choiceframe = Frame(mainframe, padx=10, pady=10)
    choiceframe.grid(column=0, row=0, sticky=(N,W,E,S))
    choiceframe.columnconfigure(0, weight=1)
    choiceframe.rowconfigure(1, weight=1)
    choicelabel = Label(choiceframe, text='All Movies')
    choicelabel.grid(column=0, row=0, sticky=(W,E))
    choicelabel['anchor'] = 'center'
    _choicelist = Listbox(choiceframe, listvariable=_choicesVar)
    _choicelist.grid(column=0, row=1, sticky=(N,W,E,S), padx=5)
    _choicelist.bind('<Double-1>', _choice_selected)
    s = Scrollbar(choiceframe, orient = VERTICAL, command = _choicelist.yview)
    s.grid(column = 1, row = 1, sticky = (N, S))
    _choicelist['yscrollcommand'] = s.set

    # the list of selected movies
    selectedframe = Frame(mainframe, padx=10, pady=10)
    selectedframe.grid(column=1, row=0, sticky=(N,W,E,S))
    selectedframe.columnconfigure(0, weight=1)
    selectedframe.rowconfigure(1, weight=1)
    selectedlabel = Label(selectedframe, text='Loved Movies')
    selectedlabel.grid(column=0, row=0, sticky=(W,E))
    selectedlabel['anchor'] = 'center'
    _selectedlist = Listbox(selectedframe, listvariable=_selectedVar)
    _selectedlist.grid(column=0, row=1, sticky=(N,W,E,S), padx=5)
    _selectedlist.bind('<Double-1>', _selected_selected)
    s = Scrollbar(selectedframe, orient = VERTICAL,
    command = _selectedlist.yview)
    s.grid(column = 1, row = 1, sticky = (N, S))
    _selectedlist['yscrollcommand'] = s.set

    # list of recommendations
    recommendedframe = Frame(mainframe, padx=10, pady=10)
    recommendedframe.grid(column=2, row=0, sticky=(N,W,E,S))
    recommendedframe.columnconfigure(0, weight=1)
    recommendedframe.rowconfigure(1, weight=1)
    recommendedlabel = Label(recommendedframe, text='Recommendations')
    recommendedlabel.grid(column=0, row=0, sticky=(W,E))
    recommendedlabel['anchor'] = 'center'
    _recommendedlist = Listbox(recommendedframe, listvariable=_recommendedVar)
    _recommendedlist.grid(column=0, row=1, sticky=(N,W,E,S), padx=5)
    s = Scrollbar(recommendedframe, orient = VERTICAL,
    command = _recommendedlist.yview)
    s.grid(column = 1, row = 1, sticky = (N, S))
    _recommendedlist['yscrollcommand'] = s.set

def mainloop():
    """Starts the UI event loop. Won't return."""

The first recommender: user-based collaborative filtering

To make Movie Buddy do something useful from the get go, we’ll implement a straightforward user-based collaborative filter. Collaborative filtering is a k-nearest-neighbor (knn) algorithm: For the current user, find the k most similar users in the dataset, collect their votes, and recommend the movies with the highest number of votes that the current user hasn’t seen yet.

The underlying assumption is that if two users behaved similarly in the past, they will continue to behave similarly in the future. Or, in the movie domain, if two people had the same taste in movies in the past, they will have the same taste in the future. This has proven to be a reasonable assumption in a number of domain.

No article about recommendation algorithms is complete without a user-item matrix, so here it is – used to explain what’s going on:


Assume this is our dataset and we want to recommend a single movie to user U1. Which one should we choose?

Intuitively, U2 seems to be the most similar to U1, and U2 loves M5, so M5 might be a good choice.

How do we quantify the similarity between users? There are different options, and you should evaluate several similarity measures to make sure you use a good one for your domain. I’ll do the evaluation in a later post. Here, I’m going with the Jaccard index.

Given two sets A and B, the Jaccard index is defined as:




The nice thing with Jaccard is that it takes the size of the sets into account. This ensures that a user who loves a huge number of movies isn’t considered similar to everyone else, as would be the case if we simply counted the number of loved movies two users have got in common.

Then again, as I said before, without an evaluation we cannot be really sure that Jaccard is a reasonably good choice.

The following implementation of user-based collaborative filtering is, as they say, naive in that it actually runs through the list of all known users to determine the most similar neighbors. You don’t want to do that in a production system except if the number of your users is and will remain small (the running time is in O(n lg n) where n is the number of users assuming each user rates less than a constant number of items).

""" Collection of recommendation algorithms. """

# Author: Eric Schwarzkopf <>
# License: BSD Style.

import collections

# factory

# maps algorithm id to implementing classes. Used to
# create algorithm instances dynamically
_rec_registry = {
    'ub': 'UserBasedKNN'

def create_recommender(id):
    """ Factory that constructs recommender algorithms.

    id: string
    The id of the algorithm to instantiate.

    r: instance
        Instance of the requested algorithm, or None.
    if (id in _rec_registry):
        return globals()[_rec_registry[id]]()
        return None

# similarity measures

def jaccard(x, y):
    """ Computes the Jaccard similarity between two sets.

    x,y : set
        Sets to be compared.

    s : float
        Jaccard coefficient.
    assert isinstance(x, set)
    assert isinstance(y, set)
    return len(x & y) / float(len(x | y))

# algorithms

class UserBasedKNN(object):
    """ A user-based knn recommender. """

    def fit(self, user_votes, similarity, num_neighbors = 10):
        Fits the recommender to the supplied data.

        user_votes : array-like, shape = [n_users]
            List of liked items for each user.

        similarity : callable
            The similarity measure to use to determine
            nearest neighbors among users.

        num_neighbors : int, optional (default = 1.0)
            maximum size of a neighborhood
        self.user_votes = [set(v) for v in user_votes]
        self.similarity = similarity
        self.num_neighbors = num_neighbors

    def _findknn(self, cv):
        """ Finds the k nearest neighbors given a list of voted-for items."""
        nn = [(v,self.similarity(cv, v)) for v in self.user_votes]
        nn.sort(cmp = lambda x,y: cmp(x[1], y[1]), reverse = True)
        return nn[0:self.num_neighbors]

    def _topvotes(self, cv, knn, num):
        """ Determines the top-voted items in a neighborhood. """
        votes = [v for vs,w in knn if w > 0 for v in vs]
        counted_votes = collections.Counter([v for v in votes if v not in cv])
        return counted_votes.most_common(num)

    def recommend(self, votes, num):
        Determines item recommendations for a user with the supplied
        voting hierarchy.

        votes : array-like
            Items voted-for by the current user.

        num : int
            Maximum number of recommendations.

        C : array-like, shape = [n_recommendations]
            Returns the sorted list of recommendations, where a
            recommendation is represented by a tuple (item_id, num_votes).
            Recommendations are sorted in non-increasing number of votes.
      cv = set(votes)
      knn = self._findknn(cv)
      return self._topvotes(cv, knn, num)

Bringing it all together is the script that stitches the UI and the algorithm together and initializes them with data. It expects the name of the algorithm to use (ub for user-based for now), the location of the movie-lens movies.dat  file, and the location of the transformed votes file as arguments. For example:

python ub ml-1m/movies.dat votes.csv

Here it is:

""" Movie Buddy's driver. Stitches everything together. """

# Author: Eric Schwarzkopf <>
# License: BSD Style

import recui
import recalg

import argparse
import sys

def read_movies_from(file):
   """ Reads a movies.dat file and extracts movie names and ids. """
   title_id = {}
   id_title = {}
   with open(file, 'r') as csvfile:
       for line in csvfile:
           tokens = line.split("::")
           title_id[str(tokens[1])] = int(tokens[0])
           id_title[int(tokens[0])] = str(tokens[1])
 return (title_id, id_title)

def read_votes_from(file):
   """ Reads a votes file as generated by translate_ratings. """
   votes = []
   with open(file, 'r') as csvfile:
       for line in csvfile:
           tokens = line.split(',')
           votes.append(set([int(v) for v in tokens[1:]]))
   return votes

def create_callback(rec, title_id, id_title):
   """ Creates a callback function to be installed into the UI that computes
   def callback(selected):
       recs = rec.recommend([title_id[m] for m in selected], 5)
       recui.set_recs([id_title[r] for r,s in recs])
   return callback

def parse_args():
   parser = argparse.ArgumentParser(description="a movie recommender")
   help="name of the recommendation algorithm to use")
   help="name of the file containing the list of movies")
   help="name of the file containing the users' votes")
   return parser.parse_args()

def main():
   args = parse_args()
   # instantiate the requested recommender
   rec = recalg.create_recommender(args.rec)
   if not rec:
       print "unknown recommendation algorithm requested"
   # read votes and train the recommender
   votes = read_votes_from(args.votes), recalg.jaccard, 10)
   # read the movie data
   title_id, id_title = read_movies_from(args.movies)
   sorted_titles = list(title_id.keys())
   # init the UI and enter the event loop
   recui.init_ui("Movie Recommender", create_callback(rec, title_id, id_title))

if __name__ == "__main__":

Once Movie Buddy has started up, you’ll see something like this:

Bildschirmfoto 2013-04-11 um 16.10.10

You can select a movie by double-clicking on it. Selecting a movie in the leftmost list adds it to your Loved Movies. Selecting it in the middle list removes it. If you press any character key, the list of all movies jumps to the first title that starts with that character.

Bildschirmfoto 2013-04-11 um 16.11.44

That’s it for the first part of the tutorial. Have fun with your new Movie Buddy.


A/B Testing Reading List

I’m currently setting up an A/B testing environment at work. It’s about time, as we’re getting bogged down in arguments about UX design options based on personal opinions alone. Here’s an annotated list of resources I found helpful in formulating requirements and discovering best practices:

  1. Controlled experiments on the web: survey and practical guide (2009) by Ron Kohavi et al is a great entry point to understanding on-line experiments. Kohavi was the lead of Microsoft’s Experimentation Platform and the paper contains lots of real world examples and tidbits of knowledge gained during that time. It doesn’t go too much into statistical details, though. You might want to look at the following two resources to fresh up on those, in particular as it seems to me that some of the formulas mentioned  (t-test, required sample sizes) in the paper contain simplifications that aren’t generally valid and of which you should be aware.
  2. Wikipedia’s entry on Statistical Hypothesis Testing. Great to fresh-up your memory about test statistics. In A/B testing, you’ll use the two-sample unpooled t-test.
  3. Gerald van Belle’s Statistical Rules of Thumb (2002), Chapter 2, Sample Size will provide you with a tool to estimate the number of samples (users) you need to observe per treatment group. In particular, look at 2.2 Calculating Sample Size Using the Coefficient of  Variation. Note the differences to the simplified formular in Kohavi et al.
  4. Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained (2012) by Kohavi et al. More insights and best practices from Microsoft. I like the idea of using a Null-Test to ensure that your assignment of users to treatment groups is actually random.
  5. Seven Pitfalls to Avoid when Running Controlled Experiments on the Web (2009) by Cook et al. Even more best practices from Microsoft’s Experimentation Platform team.






User-Based and Item-Based Collaborative Filtering

When starting to design a recommendation system, you’re most likely to look at collaborative filtering approaches first, as those have proven themselves in a wide variety of domains.

The original approach to collaborative filtering is user-based [Resnick et. al, 1994]: To determine recommendations for the active user, identify similar users and compute a weighted average of their ratings of items not yet seen by the active user. Similarity is computed based on the users’ historical rating behaviors. In pseudocode:

    u = currentUser()
    S_u = usersSimilarTo(u)
    for i in itemsRatedBy(S_u) and i not in itemsRatedBy(u):
        s = score(i, u, S_u)
        recommendations += (i, s)
    return recommendations

That’s a pretty good algorithm if you can find for each user a reasonably large set of reasonably similar users who rated the item.

What reasonably means depends on your data set: You might need an overlap of 10 or more rated items between users to be able to pick peers with similar interests. Then again, a single item may be enough if it is a good indicator of interest. In general, the less your items say about the interests of a user, the more data you need.

For many domains, this means that you need a lot of data to be able to determine good recommendations for a large percentage of your users. In particular, the algorithm will struggle to recommend relevant things to new users. That’s the new user ramp-up problem.

Assuming you got lots of users rating lots of items in your system, you face another problem: Efficiently identifying the neighborhood of a given user. With a couple of 100,000 users you can’t do this per recommendation request, so you’d have to fall back to either reducing the number of similarities to be computed (e.g., via clustering), which reduces accuracy, or precompute recommendations off-line, which reduces your responsiveness.

Enter item-based collaborative filtering. Originally developed by Greg Linden et. al at Amazon, it is specifically designed to mitigate the new user-ramp up problem as well as being responsive even for huge numbers of users.

Item-based collaborative filtering determines recommendations by first looking up the items the active user rated. Then, for each item similar to the rated items, it computes a score based on these similarities and ratings, with rated items that are more similar to the candidate item having a stronger influence on its final score. In pseudocode

    u = currentUser()
    I_u = itemsRatedBy(u)
    for i in itemsCloseTo(I_u):
        s = score(i, u, I_u)
        recommendations += (i, s)
    return recommendations

The similarity between two items is determined by comparing their respective user-rating vectors. Thus, if a user gives a similar rating to two items, it is interpreted as indicating some sort of similarity between the items. The more similar the co-ratings, the more similar the items.

Why does the item-based approach tend to suffer less from the new user problem? Assume a new user rates his first item. A user-based approach would have to determine the similarity between this and other users based on a single rating, which won’t be very accurate in general. The item-based approach would look up items similar to the item rated by the new user. The similarity between items is determined by the set of co-ratings of these items by all users, and thus tends to contain more data points leading to a better estimate of their true similarity. Linden et al. report relevant recommendations for users having rated as few as 2 items.

Why can an item-based collaborative filter be implemented in a way that is more responsive to changing user preferences? At first, it seems that we have to do even more work: For each item rated by a user, compute the similarity to all other items. That’s true, but we don’t need to compute these similarities on-line, as they are the same for each user. Furthermore, if a new co-rating is observed, it won’t affect the similarities between items all that much (assume we’ve already collected a bunch of co-ratings). Thus it is usually sufficient to update the item similarities every couple of hours without risking poor recommendations. With the similarity graph between items precomputed, what is left to do at run-time is looking up the required similarities and combine them to form the predicted rating. And this doesn’t take longer than a couple of milliseconds on current hardware.

A disadvantage of item-based collaborative filtering is that it is more prone to suffer from the new-item ramp up problem: An item won’t be recommended until there is a subset of users who co-rated the item with other items – the new item must first connect to other items already part of the item-similarity graph. One or two co-ratings typically won’t do. With user-based collaborative filtering, as soon as a single user rates the new item, it can potentially be recommended to other, similar users.

Nevertheless, due to its ability to scale to large numbers of users and its predictive accuracy, item-based collaborative filtering is used by the likes of Amazon, Google News, YouTube, and Netflix. And it’s a good place to start if you want to provide recommendations as part of your site or app.

We’ll look at actual implementations of both algorithms in my next post. If you can’t wait and want to experiment with collaborative filtering in the meantime, take a look at Apache’s Mahout project. It provides, among a bunch of other things, implementations of user- and item-based collaborative filtering as well as a collection of pluggable similarity measures.



This blog has been dead for a couple of years now, but as I find myself looking for a place to publish some thoughts every now and then that go beyond a tweet, I decided to reactivate it.

This is a complete reboot that doesn’t contain my former posts. My old blog is still available at, should you for whatever reason feel the urge to take a look at it.

New content is coming soon.

Thanks for reading!