Predicting Second Down

The purpose of this notebook is to implement some machine learning algorithms to try and predict whether a team will run or pass on second down, using the 2015 NFL Play-By-Play dataset.

In [1]:
# First Let's import the necessary packages
import pandas as pd # Package used for working with dataframes
import numpy as np # Package used for working with arrays
import matplotlib.pyplot as plt #Package used visualizing our data
%matplotlib inline 
#This command causes plots to show up in notebook

#sklearn is the machine learning package
#First we load in the instances to preprocess our data
from sklearn import preprocessing
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn import cross_validation

#Next we load the various algorithms we will be using:
#logistic regression, k-nearest neighbors, support vector machines, random trees and random forests
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

#Finally, the gridsearch instance will allow us to fine-tune our algorithms
from sklearn.grid_search import GridSearchCV
In [2]:
#Now we will read in the data pick out the second down plays which resulted in a run or a pass

file_loc = "/home/matt/Downloads/nfl.csv"
data = pd.read_csv(file_loc)

second = data[(data['down'] == 2) & ((data['PlayType'] == 'Pass') | (data['PlayType'] == 'Run'))]
/home/matt/anaconda3/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (26) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Before running any algorithms, let's do some naive predictions. A very basic prediction is to always choose the most frequent play type.

In [3]:
breakdown = second['PlayType'].value_counts()
breakdown['Pass']/ (breakdown['Pass'] + breakdown['Run'])
Out[3]:
0.58024926267719534

This tells us that we can guess correctly 58% of the time just by always choosing pass. On the other hand, if it is second and short a run is more likely than a pass, as the following plots illustrate.

In [4]:
passing = second[second['PlayType'] == 'Pass']
running = second[second['PlayType'] == 'Run']

plt.figure(figsize = (10, 5))
for i in range(1, 11):
    plt.subplot(2, 5, i)
    plt.pie([second[second['ydstogo'] == i]['PlayType'].value_counts()['Pass'],
           second[second['ydstogo'] == i]['PlayType'].value_counts()['Run']], colors = ['Blue', 'Red'])
    plt.title('Second and %i' %i)
    plt.tight_layout()
plt.legend(['Pass','Run'], bbox_to_anchor = (2, 1.5))
Out[4]:
<matplotlib.legend.Legend at 0x7f465a48f2e8>

As we can see, it is reasonable to predict run if it is second down with five or fewer yards to go, and pass otherwise. Let's see how good this simple strategy is.

In [5]:
second['Prediction'] = second['ydstogo'] <= 5
second['Ran'] = second['PlayType'] == 'Run'

check = second['Prediction'] == second['Ran']
check.mean()
/home/matt/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/home/matt/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
Out[5]:
0.62553515364855861

So we have a 62.6% chance of predicting correctly if we choose run when it is second down with five or fewer yards to go, and choose pass otherwise. Let's see if we can beat this by applying some common classification algorithms. First let's single out some columns which we believe will make the best predictors, and separate our data into training and testing subsets.

In [6]:
#TimeInHalf will contain seconds left in the half
second['TimeInHalf'] = second['TimeSecs']%(60*30)

pre_X = second[['ydstogo', 'yrdline100', 'TimeInHalf', 'ScoreDiff']]
y = second['PlayType']

#Separate 20% of the data for testing
pre_X_train, pre_X_test, y_train, y_test = cross_validation.train_test_split(pre_X, y, test_size = .2, random_state = 0)

#Normalize data for better results
X_train = preprocessing.scale(pre_X_train)
X_test =  preprocessing.scale(pre_X_test)
X_train = preprocessing.normalize(X_train, norm = 'l2')
X_test = preprocessing.normalize(X_test, norm = 'l2')
/home/matt/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

It will be easier to visualize things with two predictors, so we can plot them on a plane. Fortunately, sklearn has an instance which chooses the two best predictors. It turns out that this is the number of yards to go and the score differential.

In [7]:
#Choose the 2 best features
selector = SelectKBest(f_classif, k = 2)
selector.fit(X_train, y_train)
scores = -np.log10(selector.pvalues_)

plt.bar([1, 2, 3, 4],scores, align = 'center')
plt.title("Feature Selection")
plt.xlabel("Features")
plt.ylabel("Score")
plt.xticks([1,2,3,4],['Yards to Go','Yard Line', 'Time in Half', 'Score Differential'])
Out[7]:
([<matplotlib.axis.XTick at 0x7f4659885208>,
  <matplotlib.axis.XTick at 0x7f465989eeb8>,
  <matplotlib.axis.XTick at 0x7f46598de8d0>,
  <matplotlib.axis.XTick at 0x7f46597ec748>],
 <a list of 4 Text xticklabel objects>)
In [8]:
#Pick the first and fourth columns, which contain the best predictors
X_train = X_train[:, [0, 3]]
X_test = X_test[:, [0, 3]]
In [9]:
plt.figure(figsize = (10, 5))
plt.subplot(121)

color = []
for value in y_test:
    if value == 'Run':
        color.append('Red')
    else:
        color.append('Blue')    
plt.scatter(pre_X_test['ydstogo'], pre_X_test['ScoreDiff'], color = color)
plt.xlabel('Yards To Go')
plt.ylabel('Score Differential')
plt.title("Actual Decisions from Testing Data")

color = []
for value in pre_X_test['ydstogo']:
    if value <= 5:
        color.append('Red')
    else:
        color.append('Blue')
plt.subplot(122)
plt.scatter(pre_X_test['ydstogo'], pre_X_test['ScoreDiff'], color = color)
plt.xlabel('Yards To Go')
plt.ylabel('Score Differential')
plt.title("Naive Prediction on Testing Data")
plt.tight_layout()
In [10]:
plt.figure(figsize = (10, 5))

color = []
for value in y_test:
    if value == 'Run':
        color.append('Red')
    else:
        color.append('Blue')    

plt.subplot(121)
plt.scatter(X_test[:,0], X_test[:, 1], color = color)
plt.xlabel('Yards To Go')
plt.ylabel('Score Differential')
plt.title("Actual Decisions, Normalized")

color = []
for value in pre_X_test['ydstogo']:
    if value <= 5:
        color.append('Red')
    else:
        color.append('Blue')
plt.subplot(122)
plt.scatter(X_test[:, 0], X_test[:, 1], color = color)
plt.xlabel('Yards To Go')
plt.ylabel('Score Differential')
plt.title("Naive Prediction, Normalized")
plt.tight_layout()

Now we will apply a few classification algorithms: logistic regression, k-nearest neighbors, support vector machine, decision tree, and random forest.

In [60]:
def graphPrediction(prediction, pred_name):

    plt.figure(figsize = (10, 5))
    color = []
    for value in y_test:
        if value == 'Run':
            color.append('Red')
        else:
            color.append('Blue')    
    plt.subplot(121)
    plt.scatter(pre_X_test['ydstogo'], pre_X_test['ScoreDiff'], color = color)
    plt.xlabel('Yards To Go')
    plt.ylabel('Score Differential')
    plt.title("Actual Decisions from Testing Data")

    color = []
    for value in prediction:
        if value == 'Run':
            color.append('Red')
        else:
            color.append('Blue')
    plt.subplot(122)
    plt.scatter(pre_X_test['ydstogo'], pre_X_test['ScoreDiff'], color = color)
    plt.xlabel('Yards To Go')
    plt.ylabel('Score Differential')
    plt.title(pred_name + " Prediction")
    plt.tight_layout()
    
    #Print the accuracy of the prediction model
    check = prediction == y_test
    accuracy = check.mean()
    print("%s Accuracy: %f" %(pred_name, accuracy))
    
In [11]:
#Run logistic regression on our training data

logistic = LogisticRegression()
logistic.fit(X_train, y_train)
Out[11]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [61]:
graphPrediction(logistic.predict(X_test), "Logistic Regression")
Logistic Regression Accuracy: 0.636234
In [14]:
# Use k-nearest neighbors algorithm. I chose 61 through trial and error, but there's
# a way to have sklearn optimize paramters. I need to look into that more.

neighbors = KNeighborsClassifier(n_neighbors = 61)
neighbors.fit(X_train,y_train)
Out[14]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=61, p=2,
           weights='uniform')
In [62]:
graphPrediction(neighbors.predict(X_test), "61-Nearest Neighbors")
61-Nearest Neighbors Accuracy: 0.647646
In [17]:
# Run support vector machine with a linear kernel. Again, it may be
# possible to do better by tweaking some parameters.
support = svm.SVC(kernel = "linear")
support.fit(X_train, y_train)
Out[17]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
In [63]:
graphPrediction(support.predict(X_test), "Support Vector Machine")
Support Vector Machine Accuracy: 0.639563
In [20]:
dectree = tree.DecisionTreeClassifier()
dectree.fit(X_train, y_train)
Out[20]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
In [64]:
graphPrediction(dectree.predict(X_test), 'Decision Tree')
Decision Tree Accuracy: 0.561103
In [23]:
Rfor = RandomForestClassifier(n_estimators = 50)
Rfor.fit(X_train, y_train)
Out[23]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [65]:
graphPrediction(Rfor.predict(X_test), "Random Forest")
Random Forest Accuracy: 0.588207

In conclusion, we were able to do better (though not a whole lot better) than the naive prediction using logistic regression, k-nearest neighbors, and the support vector machine algorithms. It should be possible to do even better by using grid-searching techniques to optimize our parameters.

In [ ]: