Predicting Sports Winners with Decision Trees and pandas

5 min read

In this article by Robert Craig Layton, author of Learning Data Mining with Python, we will look at predicting the winner of games of the National Basketball Association (NBA) using a different type of classification algorithm—decision trees.

Collecting the data

The data we will be using is the match history data for the NBA, for the 2013-2014 season. The website contains a significant number of resources and statistics collected from the NBA and other leagues. Perform the following steps to download the dataset:

  1. Navigate to in your web browser.
  2. Click on the Export button next to the Regular Season heading.
  3. Download the file to your data folder (and make a note of the path).

This will download a CSV file containing the results of 1,230 games in the regular season of the NBA.

We will load the file with the pandas library, which is an incredibly useful library for manipulating data. Python also contains a built-in library called csv that supports reading and writing CSV files. We will use pandas instead as it provides more powerful functions to work with datasets.

For this article, you will need to install pandas. The easiest way to do that is to use pip3, which you may previously have used to install scikit-learn:

$pip3 install pandas

Using pandas to load the dataset

We can load the dataset using the read_csv function in pandas as follows:

import pandas as pd
dataset = pd.read_csv(data_filename)

The result of this is a data frame, a data structure used by pandas. The pandas.read_csv function has parameters to fix some of the problems in the data, such as missing headings, which we can specify when loading the file:

dataset = pd.read_csv(data_filename, parse_dates=["Date"],
dataset.columns = ["Date", "Score Type", "Visitor Team",
"VisitorPts", "Home Team", "HomePts", "OT?", "Notes"]

We can now view a sample of the data frame:


Extracting new features

We extract our classes, 1 for a home win, and 0 for a visitor win.

We can specify this using the following code to extract those wins into a NumPy array:

dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"]
y_true = dataset["HomeWin"].values

The first two new features we want to create are to indicate whether each of the two teams won their previous game. This would roughly approximate which team is currently playing well.

We will compute this feature by iterating through the rows in order, and recording which team won. When we get to a new row, we look up whether the team won the last time:

from collections import defaultdict
won_last = defaultdict(int)

We can then iterate over all the rows and update the current row with the team’s last result (win or loss):

for index, row in dataset.iterrows():
home_team = row["Home Team"]visitor_team = row["Visitor Team"]row["HomeLastWin"] = won_last[home_team]row["VisitorLastWin"] = won_last[visitor_team]dataset.ix[index] = row

We then set our dictionary with each team’s result (from this row) for the next time we see these teams:

won_last[home_team] = row["HomeWin"]won_last[visitor_team] = not row["HomeWin"]

Decision trees

Decision trees are a class of classification algorithm such as a flow chart that consist of a sequence of nodes, where the values for a sample are used to make a decision on the next node to go to.

We can use the DecisionTreeClassifier class to create a decision tree:

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=14)

We now need to extract the dataset from our pandas data frame in order to use it with our scikit-learn classifier. We do this by specifying the columns we wish to use and using the values parameter of a view of the data frame:

X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values

Decision trees are estimators, and therefore, they have fit and predict methods. We can also use the cross_val_score method as before to get the average score:

scores = cross_val_score(clf, X_previouswins, y_true,
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This scores up to 56.1%; we are better off choosing randomly!

Predicting sports outcomes

We have a method for testing how accurate our models are using the cross_val_score method that allows us to try new features.

For the first feature, we will create a feature that tells us whether the home team is generally better than the visitors by seeing whether they ranked higher in the previous season.

To obtain the data, perform the following steps:

  1. Head to
  2. Scroll down to Expanded Standings. This gives us a single list for the entire league.
  3. Click on the Export link to the right of this heading.
  4. Save the download in your data folder.

In your IPython Notebook, enter the following into a new cell. You’ll need to ensure that the file was saved into the location pointed to by the data_folder variable:

standings_filename = os.path.join(data_folder,
standings = pd.read_csv(standings_filename, skiprows=[0,1])

We then iterate over the rows and compare the team’s standings:

dataset["HomeTeamRanksHigher"] = 0
for index, row in dataset.iterrows():
home_team = row["Home Team"]visitor_team = row["Visitor Team"]

Between 2013 and 2014, a team was renamed as follows:

if home_team == "New Orleans Pelicans":
home_team = "New Orleans Hornets"
elif visitor_team == "New Orleans Pelicans":
visitor_team = "New Orleans Hornets"

Now, we can get the rankings for each team. We then compare them and update the feature in the row:

home_rank = standings[standings["Team"] ==
home_team]["Rk"].values[0]visitor_rank = standings[standings["Team"] ==
visitor_team]["Rk"].values[0]row["HomeTeamRanksHigher"] = int(home_rank > visitor_rank)
dataset.ix[index] = row

Next, we use the cross_val_score function to test the result. First, we extract the dataset as before:

X_homehigher = dataset[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values

Then, we create a new DecisionTreeClassifier class and run the evaluation:

clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_homehigher, y_true,
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This now scores up to 60.3%—even better than our previous result.

Unleash the full power of Python machine learning with our ‘Learning Data Mining with Python‘ book.


Please enter your comment!
Please enter your name here