Multi-classification predictions and explanations for MLB play-by-play data
The purpose of this project is to build a MLB pitch type classification system using play-by-play data from the 2011 season.
MLB play-by-play data includes features such as:
The model described in this project will be predicting 8 different pitch type classifications.
-----------------------------
Pitch Type % occurrence
Fastball 0.464
Slider 0.157
Sinker 0.117
Changeup 0.097
Curveball 0.090
Cutter 0.061
Purpose_Pitch 0.007
Off-Speed 0.007
-----------------------------
The training dataset includes data from March to July 2011. The models are evaluated against unseen data from August 2011.
Overall, the model achieved 58.9% accuracy on the unseen data, a 12.5% improvement over just guessing fastball, the most common class!
The original prepared dataset had 29 features available prior to the pitch. [see features]
With the feature engineering methods described here, 29 additional features were created, and only 8 of the original 29 features were used for the classification model. [see engineered features]
The engineered features include:
The target-encoded features calculated the mean probability for each pitch_type classification for each pitcher. These features were extremely predictive on unseen data, accounting for a 9% increase in accuracy over predicting the most probable class (see classification reports for uniform prediction model and baseline heurisitic model in the section below).
A gradient boosted model was trained on the feature engineered training dataset. See this notebook for modeling details.
The gradient boosting model had an overall accuracy of 58.9%, a 3.6% improvement over a non-machine learning baseline heuristic model that guesses based on which pitch the pitcher throws the most, and a 12.5% improvement over a model that guesses the most common class (fastball).
The model is especially strong at predicting Off-Speed and Purpose_Pitch (>89% precision), performs fairly well for Fastball and Sinker (>57% precision, >66% f1 score), and performs poorly at predicting Changeup, Curveball, Cutter, and Slider pitches (<50% precision).
The end-to-end transformation and prediction pipeline outputs a result from a row of JSON data in about ~100ms. See the benchmark reports here for more details.
______________________________________________________________________
Classification Report: Uniform Prediction Model
precision recall f1-score support n_predictions
Changeup 0.000 0.000 0.000 11832 nan
Curveball 0.000 0.000 0.000 10983 nan
Cutter 0.000 0.000 0.000 7405 nan
Fastball 0.464 1.000 0.634 56519 121886
Off-Speed 0.000 0.000 0.000 818 nan
Purpose_Pitch 0.000 0.000 0.000 835 nan
Sinker 0.000 0.000 0.000 14305 nan
Slider 0.000 0.000 0.000 19189 nan
avg / total 0.215 0.464 0.294 121886
accuracy: 0.464
______________________________________________________________________
Classification Report: Baseline Heuristic Model
precision recall f1-score support n_predictions
Changeup 0.209 0.011 0.021 11832 640
Curveball 0.371 0.016 0.032 10983 488
Cutter 0.543 0.243 0.336 7405 3315
Fastball 0.571 0.919 0.705 56519 90935
Off-Speed 0.828 0.965 0.891 818 953
Purpose_Pitch 0.000 0.000 0.000 835 nan
Sinker 0.496 0.797 0.612 14305 22988
Slider 0.449 0.060 0.106 19189 2567
avg / total 0.486 0.553 0.447 121886
accuracy: 0.553
______________________________________________________________________
Classification Report: Machine Learning Model
precision recall f1-score support n_predictions
Changeup 0.393 0.108 0.170 11832 3261
Curveball 0.492 0.142 0.220 10983 3170
Cutter 0.535 0.314 0.396 7405 4341
Fastball 0.614 0.874 0.722 56519 80417
Off-Speed 0.888 0.927 0.907 818 854
Purpose_Pitch 0.914 0.614 0.735 835 561
Sinker 0.568 0.796 0.663 14305 20026
Slider 0.494 0.238 0.322 19189 9256
avg / total 0.557 0.589 0.535 121886
accuracy: 0.589
______________________________________________________________________
The classification model is an instance of ShapleyClassifier, a custom API I wrote that conveniently integrates with the shap library, to explain features contributing to the probability output.
The shap library uses an additive feature attribution method based on game theory concepts to estimate how to divide a collective payoff.
More info on shap library:
source
research paper
______________________________________________________________________
A quick primer on Shapley values, as summarized by Michael Sweeney. Imagine playing a card game where \$19 is won when 3 players are contributing, and ≤ \$19 is won when a different combination of players are involved.
______________________________________________________________________
In the context of classification, the shap library calculates feature attribution in a similar manner - by averaging the probability contribution of each feature, weighted by the number of permutations describing the sequence of contributing features. A missing feature uses a substitude "best guess" value (median for numerical features, mode for categorical features) to calculate the classification probability in the absence of the feature.
%%capture
%cd ..
import dill
import json
import pandas as pd
import warnings
# load models
with open('dill/feature_engineering_transformer.dill', 'rb') as f:
feat_eng = dill.load(f)
with open('dill/pitch_classifier.dill', 'rb') as f:
clf = dill.load(f)
# load sample JSON data
with open('sample_data/json_sample_offspeed.json', 'rb') as f:
json_sample_offspeed = json.load(f)
with open('sample_data/json_sample_purposepitch.json', 'rb') as f:
json_sample_purposepitch = json.load(f)
with open('sample_data/json_sample_sinker.json', 'rb') as f:
json_sample_sinker = json.load(f)
with open('sample_data/json_sample_fastball.json', 'rb') as f:
json_sample_fastball = json.load(f)
# load sample datasets
sample_offspeed = pd.read_csv('sample_data/sample_offspeed.csv')
sample_purposepitch = pd.read_csv('sample_data/sample_purposepitch.csv')
sample_sinker = pd.read_csv('sample_data/sample_sinker.csv')
sample_fastball = pd.read_csv('sample_data/sample_fastball.csv')
warnings.filterwarnings("ignore")
For this data point, the prediction was Off-Speed with 0.981 probability. The most important features contributing to the probability are:
This intuitively makes sense, as the pitcher has a very high probability of throwing Off-Speed pitches and 8 out of the last 10 pitches were also Off-Speeds
x = feat_eng.transform(json_sample_offspeed)
clf.predict_explain_one(x)
For this data point, the prediction was Purpose Pitch with 0.987 probability. The most important features contributing to the probability are:
This tells us that if the last pitch was a Purpose Pitch and a batter is on base, the model will likely predict that the current pitch will probably be a Purpose Pitch
x = feat_eng.transform(json_sample_purposepitch)
clf.predict_explain_one(x)
For this data point, the prediction was Sinker with 0.511 probability. The most important features contributing to the probability are:
This intuitively makes sense, this pitcher has a high likelihood of throwing Sinkers and a low likelihood of throwing Fastballs, which differs significantly from the median pitcher probability rates
x = feat_eng.transform(json_sample_sinker)
clf.predict_explain_one(x)
For this data point, the prediction was Fastball with 0.946 probability. The most important features contributing to the probability are:
which all differ from the baseline values (median) of the training set features
x = feat_eng.transform(json_sample_fastball)
clf.predict_explain_one(x)
The below visualizations are interactive!
Play around with the dropdowns for the x and y axis to explore the shapley value contributions for the dataset.
The descriptions below will describe a visualization with the following syntax. Please adjust the dropdowns to the specified values to follow the interpretation.
y-axis slider: [ value ]
x-axis alider: [ value ]
----------------------------------------------------------------------
y-axis slider: [ output value ]
x-axis alider: [ sample order by similarity ]
The most impactful feature contributing to a prediction of Off-Speed tends to be p(pitch_type_Off-Speed | pitcher_id).
----------------------------------------------------------------------
y-axis slider: [ p(pitch_type_Off-Speed | pitcher_id) effects ]
x-axis alider: [ p(pitch_type_Off-Speed | pitcher_id) ]
The median value for p(pitch_type_Off-Speed | pitcher_id) is 0. Any positive contribution from this feature will increase the probability classification for Off-Speed.
----------------------------------------------------------------------
X = feat_eng.transform(sample_offspeed)
clf.predict_explain(X, 'Off-Speed')
----------------------------------------------------------------------
y-axis slider: [ output value ]
x-axis alider: [ sample order by similarity ]
The most impactful features contributing to a prediction of Purpose Pitch tends to be on_base_any and last_pitch. This is consistent with the observation found in cell 4.
----------------------------------------------------------------------
X = feat_eng.transform(sample_purposepitch)
clf.predict_explain(X, 'Purpose_Pitch')
----------------------------------------------------------------------
y-axis slider: [ output value ]
x-axis alider: [ sample order by similarity ]
The most impactful feature contributing to a prediction of Sinker tends to be p(pitch_type_Sinker | pitcher_id).
----------------------------------------------------------------------
y-axis slider: [ p(pitch_type_Sinker | pitcher_id) effects ]
x-axis alider: [ p(pitch_type_Sinker | pitcher_id) ]
The median value for p(pitch_type_Sinker | pitcher_id) is 0. Any positive contribution from this feature will increase the probability classification for Sinker.
----------------------------------------------------------------------
X = feat_eng.transform(sample_sinker)
clf.predict_explain(X, 'Sinker')
----------------------------------------------------------------------
y-axis slider: [ output value ]
x-axis alider: [ sample order by similarity ]
Fastball is the most probable class (46.3% of all data points). There is a lot of noise in the shapley value feature contributions.
----------------------------------------------------------------------
y-axis slider: [ p(pitch_type_Fastball | pitcher_id) effects ]
x-axis alider: [ p(pitch_type_Fastball | pitcher_id) ]
The median value for p(pitch_type_Fastball | pitcher_id) is 0.53. Any value greater than 0.53 will increase the probability classification for Fastball, and any value less than 0.53 (especially when less than 0.23) will decrease the probability classification for Fastball.
----------------------------------------------------------------------
X = feat_eng.transform(sample_fastball)
clf.predict_explain(X, 'Fastball')
The code used in this repository is tested on Python 2.7.13 and requires the following packages:
*
= required for ShapleyClassifier
Interested in utilizing the API for ShapleyClassifier for your own classification problems?
ShapleyClassifier has an API that generalizes to any classification model with sklearn compatibility.
Questions? contact me at alvinthai@gmail.com