This post is about applying scikit-learn’s DecisionTreeClassifier to the Titanic passenger dataset. My original motivation was to find a simple ruleset for the first challenge of Udacity’s Intro to Data Science course: Given a list of properties of a passenger, predict whether he survived the Titanic catastrophe or not. Reach a prediction accuracy of 80% or higher.
The data set we’ll work with is the one provided by kaggle (the train set).
Preparing the data
With everything set-up, let’s start by building our first decision tree. On the command line, start ipython. Then, use pandas to load the titanic data into a DataFrame.
import pandas as pd data = pd.read_csv("titanic_data.csv") data.head()
Next is cleaning up, meaning removing columns that we don’t want the classifier to consider and getting rid of NaNs (which pandas uses to indicate missing values):
> len(data) 891 > data = data.drop([‘PassengerId’, ‘Name’, ‘Ticket’, ‘Cabin’], axis = 1) > data = data.dropna() > len(data) 712 > data.columns Index([u'Survived', u'Pclass', u'Sex', u'Age', u'SibSp', u'Parch', u'Fare', u'Embarked'], dtype=object)
With data.dropna() we did a brute force cleaning; all rows containing at least one NaN entry were removed from the DataFrame. This results in us throwing away about 20% of our data. We don’t care much about that here, but in real life, you should handle missing data a bit smarter (e.g., you could try to predict missing values).
The final thing we need to do before we can learn our first decision tree is to transform the categorical values of ‘Sex’ and ‘Embarked’ into corresponding integer values.
> data[‘Sex’] = pd.Categorical.from_array(data[‘Sex’]).labels > data[‘Embarked’] = pd.Categorical.from_array(data[‘Embarked]).labels > data.head()
Fitting a DecisionTreeClassifier
In order to to learn the decision tree, we split our frame in a target-value vector y (‘Survived’) and a feature vector X (all other columns) and fit an instance of DecisionTreeClassifier:
> from sklearn import tree > clf = tree.DecisionTreeClassifier() > y = data[‘Survived’] > X = data.drop(‘Survived’, axis = 1) > clf = clf.fit(X, y)
clf is now a decision tree fitted to our training data. scikit-learn can export the learned tree into the dot format used by graphviz for visual inspection:
> with open(“titanic_1.dot”, “w”) as f: tree.export_graphviz(clf, out_file=f, feature_names=X.columns)
On the command line, convert the dot file into a pdf and open it in your favorite pdf viewer:
dot -Tpdf titanic_1.dot -o titanic_1.pdf open titanic_1.pdf
You’ll notice that we ended up with quite a big tree.
Here’s how you read the visualization:
Evaluating the model
Let’s see how the tree performs on predicting the survival values on the training data:
> clf.score(X,y) 0.9860
So we got 99% accuracy. That’s nice, but it’s rather unlikely that we’re going to see this level of accuracy on unseen data. Looking at the depth of the tree and the fine-grained decisions on the lower levels, it’s almost certain that our tree is overfitted. Let’s make some tests to determine whether we’re right.
First we do a cross-validation to get an idea of what accuracy to expect on unseen data given our default DecisionTreeClassifier as the predictor of choice:
> from sklearn.cross_validation import cross_val_score > cross_val_score(tree.DecisionTreeClassifier(), X, y, cv=10).mean() 0.7738
That’s quite a bit lower and even worse than simple classification heuristics such as “assume all females survived and all males died”. The modeling approach we’ve chosen is less than optimal.
Is the reason overfitting? Or is it a bias inherent in the model? Let’s look at the learning curve to see what’s happening:
We see that the training error stays low when we add more training examples, while the test error remains quite a bit higher – the typical pattern of an overfitted model.
Pruning the tree – the manual way
Approaches to mitigate overfitting include removing features from the data and limiting model complexity. Here, we’ll go with the latter as we’ve seen that the current model is quite complex. So there’s lots of room for improvement.
scikit-learn provides several options for limiting the maximum complexity of a fitted DecisionTreeClassifier:
- Restrict the maximum depth of the tree.
- Set the minimum number of samples required to be a leaf node.
- Set the minimum number of samples required to split an internal node.
We’ll explore limiting the maximum depth with the goal of finding the shallowest model that’s still good enough (expected accuracy >= 80%). Here’s a plot of the classification error with increasing tree depth:
Looks like a depth between 3 and 5 should turn out to be a reasonable compromise between complexity and accuracy. Let’s do the cross validations:
> cross_validation_score(tree.DecisionTreeClassifier(max_depth=2), X, y, cv=10).mean() 0.7880 > cross_validation_score(tree.DecisionTreeClassifier(max_depth=3), X, y, cv=10).mean() 0.8048 > cross_validation_score(tree.DecisionTreeClassifier(max_depth=4), X, y, cv=10).mean() 0.7978 > cross_validation_score(tree.DecisionTreeClassifier(max_depth=5), X, y, cv=10).mean() 0.8146
The tree of depth 3 looks like this:
That’s simple enough for our purposes. Rewritten as an if-then-else statement, the tree becomes:
if passenger['Sex'] == "female": p = passenger['Pclass'] <= 2 or passenger['Fare'] <= 20.8 else: p = passenger['Age'] <= 6.5 and passenger['SibSp'] <= 2.5 predictions[passenger['PassengerId']] = p
This simple ruleset reaches an accuracy of 82.6% in the Udacity challenge — far above the required minimum of 80%.