MLB Pitch Type Predictor

Multi-classification predictions and explanations for MLB play-by-play data


Table of Contents


Introduction

The purpose of this project is to build a MLB pitch type classification system using play-by-play data from the 2011 season.

MLB play-by-play data includes features such as:

  • How many balls, strikes, fouls, and outs are in play at the current plate appearance?
  • What is the score of the game? What inning is the game in?
  • Which bases are currently occupied?
  • What teams are playing each other?
  • Who is the batter and pitcher? What hand does the pitcher throw with? What side of the plate does the batter stand? How tall is the batter?

The model described in this project will be predicting 8 different pitch type classifications.

-----------------------------
Pitch Type % occurrence

Fastball 0.464
Slider 0.157
Sinker 0.117
Changeup 0.097
Curveball 0.090
Cutter 0.061
Purpose_Pitch 0.007
Off-Speed 0.007
-----------------------------

The training dataset includes data from March to July 2011. The models are evaluated against unseen data from August 2011.

Overall, the model achieved 58.9% accuracy on the unseen data, a 12.5% improvement over just guessing fastball, the most common class!


Feature Engineering

The original prepared dataset had 29 features available prior to the pitch. [see features]

With the feature engineering methods described here, 29 additional features were created, and only 8 of the original 29 features were used for the classification model. [see engineered features]

The engineered features include:

  • Ballpark data from the previous season (2010) scraped from ESPN
  • Batter data from the previous season (2010) scraped from baseball-reference
  • Derived features calculated directly from a row of input data
  • Historical near-real time features describing the outcome of the last 1-10 pitches
  • Aggregated features in the training dataset grouped by pitcher
  • Target-encoded features in the training dataset grouped by pitcher

The target-encoded features calculated the mean probability for each pitch_type classification for each pitcher. These features were extremely predictive on unseen data, accounting for a 9% increase in accuracy over predicting the most probable class (see classification reports for uniform prediction model and baseline heurisitic model in the section below).


Evaluation Report

A gradient boosted model was trained on the feature engineered training dataset. See this notebook for modeling details.

The gradient boosting model had an overall accuracy of 58.9%, a 3.6% improvement over a non-machine learning baseline heuristic model that guesses based on which pitch the pitcher throws the most, and a 12.5% improvement over a model that guesses the most common class (fastball).

The model is especially strong at predicting Off-Speed and Purpose_Pitch (>89% precision), performs fairly well for Fastball and Sinker (>57% precision, >66% f1 score), and performs poorly at predicting Changeup, Curveball, Cutter, and Slider pitches (<50% precision).

The end-to-end transformation and prediction pipeline outputs a result from a row of JSON data in about ~100ms. See the benchmark reports here for more details.

______________________________________________________________________

Classification Report: Uniform Prediction Model

            precision    recall  f1-score   support  n_predictions

Changeup 0.000 0.000 0.000 11832 nan
Curveball 0.000 0.000 0.000 10983 nan
Cutter 0.000 0.000 0.000 7405 nan
Fastball 0.464 1.000 0.634 56519 121886
Off-Speed 0.000 0.000 0.000 818 nan
Purpose_Pitch 0.000 0.000 0.000 835 nan
Sinker 0.000 0.000 0.000 14305 nan
Slider 0.000 0.000 0.000 19189 nan

avg / total 0.215 0.464 0.294 121886

accuracy: 0.464

______________________________________________________________________

Classification Report: Baseline Heuristic Model

            precision    recall  f1-score   support  n_predictions

Changeup 0.209 0.011 0.021 11832 640
Curveball 0.371 0.016 0.032 10983 488
Cutter 0.543 0.243 0.336 7405 3315
Fastball 0.571 0.919 0.705 56519 90935
Off-Speed 0.828 0.965 0.891 818 953
Purpose_Pitch 0.000 0.000 0.000 835 nan
Sinker 0.496 0.797 0.612 14305 22988
Slider 0.449 0.060 0.106 19189 2567

avg / total 0.486 0.553 0.447 121886

accuracy: 0.553

______________________________________________________________________

Classification Report: Machine Learning Model

            precision    recall  f1-score   support  n_predictions

Changeup 0.393 0.108 0.170 11832 3261
Curveball 0.492 0.142 0.220 10983 3170
Cutter 0.535 0.314 0.396 7405 4341
Fastball 0.614 0.874 0.722 56519 80417
Off-Speed 0.888 0.927 0.907 818 854
Purpose_Pitch 0.914 0.614 0.735 835 561
Sinker 0.568 0.796 0.663 14305 20026
Slider 0.494 0.238 0.322 19189 9256

avg / total 0.557 0.589 0.535 121886

accuracy: 0.589

______________________________________________________________________


Feature Interpretation

The classification model is an instance of ShapleyClassifier, a custom API I wrote that conveniently integrates with the shap library, to explain features contributing to the probability output.

The shap library uses an additive feature attribution method based on game theory concepts to estimate how to divide a collective payoff.
More info on shap library:
source
research paper

______________________________________________________________________

A quick primer on Shapley values, as summarized by Michael Sweeney. Imagine playing a card game where \$19 is won when 3 players are contributing, and ≤ \$19 is won when a different combination of players are involved.

  1. You first start by identifying each player’s contribution when they play individually, when 2 play together, and when all 3 play together.
  2. Then, you need to consider all possible orders and calculate their marginal value – e.g. what value does each player add when player A enters the game first, followed by player B, and then player C. Below are the 6 possible orders and the marginal value each player adds in the different combinations:
  3. Now that we have calculated each player’s marginal value across all 6 possible order combinations, we now need to add them up and work out the Shapley value (i.e. the average) for each player.
  4. Now that we have worked out the Shapley value for each player, we can clearly see the true contribution each player made to the game and assign credit fairly. In this example, player C contributed the most, followed by A, then by B.

______________________________________________________________________

In the context of classification, the shap library calculates feature attribution in a similar manner - by averaging the probability contribution of each feature, weighted by the number of permutations describing the sequence of contributing features. A missing feature uses a substitude "best guess" value (median for numerical features, mode for categorical features) to calculate the classification probability in the absence of the feature.

Import Sample Data

In [1]:
%%capture
%cd ..
In [2]:
import dill
import json
import pandas as pd
import warnings

# load models
with open('dill/feature_engineering_transformer.dill', 'rb') as f:
    feat_eng = dill.load(f)
with open('dill/pitch_classifier.dill', 'rb') as f:
    clf = dill.load(f)

# load sample JSON data
with open('sample_data/json_sample_offspeed.json', 'rb') as f:
    json_sample_offspeed = json.load(f)
with open('sample_data/json_sample_purposepitch.json', 'rb') as f:
    json_sample_purposepitch = json.load(f)
with open('sample_data/json_sample_sinker.json', 'rb') as f:
    json_sample_sinker = json.load(f)
with open('sample_data/json_sample_fastball.json', 'rb') as f:
    json_sample_fastball = json.load(f)

# load sample datasets
sample_offspeed = pd.read_csv('sample_data/sample_offspeed.csv')
sample_purposepitch = pd.read_csv('sample_data/sample_purposepitch.csv')
sample_sinker = pd.read_csv('sample_data/sample_sinker.csv')
sample_fastball = pd.read_csv('sample_data/sample_fastball.csv')

warnings.filterwarnings("ignore")

Explaining a Single Prediction

Off-Speed

For this data point, the prediction was Off-Speed with 0.981 probability. The most important features contributing to the probability are:

  • p(pitch_type_Off-Speed | pitcher_id) = 0.77
  • pcount_pitcher = 106
  • Off-Speed_L10 = 8

This intuitively makes sense, as the pitcher has a very high probability of throwing Off-Speed pitches and 8 out of the last 10 pitches were also Off-Speeds

In [3]:
x = feat_eng.transform(json_sample_offspeed)
clf.predict_explain_one(x)
baseline_val input_val
prediction for Off-Speed 0.000128 0.9812
baseline_val input_val effect
p(pitch_type_Off-Speed | pitcher_id) 0 0.771576 0.332
pcount_pitcher 29 106 0.159
Off-Speed_L10 0 8 0.112
p(pitch_type_Changeup | pitcher_id) 0.0982587 0 0.105
p(pitch_type_Fastball | pitcher_id) 0.526892 0.226079 0.093
Out[3]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security.

Purpose Pitch

For this data point, the prediction was Purpose Pitch with 0.987 probability. The most important features contributing to the probability are:

  • on_base_any = True
  • last_pitch = Purpose_Pitch

This tells us that if the last pitch was a Purpose Pitch and a batter is on base, the model will likely predict that the current pitch will probably be a Purpose Pitch

In [4]:
x = feat_eng.transform(json_sample_purposepitch)
clf.predict_explain_one(x)
baseline_val input_val
prediction for Purpose_Pitch 0.000161 0.986993
baseline_val input_val effect
on_base_any 0 True 0.419
last_pitch Fastball Purpose_Pitch 0.334
strikes 1 0 0.076
balls 1 3 0.044
on_base_2 0 True 0.037
Out[4]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security.

Sinker

For this data point, the prediction was Sinker with 0.511 probability. The most important features contributing to the probability are:

  • p(pitch_type_Sinker | pitcher_id) = 0.63
  • p(pitch_type_Fastball | pitcher_id) = 0.04

This intuitively makes sense, this pitcher has a high likelihood of throwing Sinkers and a low likelihood of throwing Fastballs, which differs significantly from the median pitcher probability rates

In [5]:
x = feat_eng.transform(json_sample_sinker)
clf.predict_explain_one(x)
baseline_val input_val
prediction for Sinker 0.000173 0.510577
baseline_val input_val effect
p(pitch_type_Sinker | pitcher_id) 0 0.62988 0.384
p(pitch_type_Fastball | pitcher_id) 0.526892 0.0369599 0.098
strikes 1 2 -0.041
Sinker_L10 0 6 0.036
p(pitch_type_Slider | pitcher_id) 0.152896 0.276939 -0.03
Out[5]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security.

Fastball

For this data point, the prediction was Fastball with 0.946 probability. The most important features contributing to the probability are:

  • Strikes = 0
  • p(pitch_type_Fastball | pitcher_id) = 0.72
  • pcount_pitcher = 2
  • inning = 1
  • p(pitch_type_Curveball | pitcher_id) = 0

which all differ from the baseline values (median) of the training set features

In [6]:
x = feat_eng.transform(json_sample_fastball)
clf.predict_explain_one(x)
baseline_val input_val
prediction for Fastball 0.533462 0.946142
baseline_val input_val effect
strikes 1 0 0.171
p(pitch_type_Fastball | pitcher_id) 0.526892 0.712991 0.066
pcount_pitcher 29 2 0.056
inning 5 1 0.055
p(pitch_type_Curveball | pitcher_id) 0.0885057 0 0.034
Out[6]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security.

Explaining Multiple Predictions

The below visualizations are interactive!
Play around with the dropdowns for the x and y axis to explore the shapley value contributions for the dataset.

The descriptions below will describe a visualization with the following syntax. Please adjust the dropdowns to the specified values to follow the interpretation.
y-axis slider: [ value ]
x-axis alider: [ value ]

Off-Speed

----------------------------------------------------------------------

y-axis slider: [ output value ]
x-axis alider: [ sample order by similarity ]

The most impactful feature contributing to a prediction of Off-Speed tends to be p(pitch_type_Off-Speed | pitcher_id).

----------------------------------------------------------------------

y-axis slider: [ p(pitch_type_Off-Speed | pitcher_id) effects ]
x-axis alider: [ p(pitch_type_Off-Speed | pitcher_id) ]

The median value for p(pitch_type_Off-Speed | pitcher_id) is 0. Any positive contribution from this feature will increase the probability classification for Off-Speed.

----------------------------------------------------------------------

In [7]:
X = feat_eng.transform(sample_offspeed)
clf.predict_explain(X, 'Off-Speed')
Out[7]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security.

Purpose Pitch

----------------------------------------------------------------------

y-axis slider: [ output value ]
x-axis alider: [ sample order by similarity ]

The most impactful features contributing to a prediction of Purpose Pitch tends to be on_base_any and last_pitch. This is consistent with the observation found in cell 4.

----------------------------------------------------------------------

In [8]:
X = feat_eng.transform(sample_purposepitch)
clf.predict_explain(X, 'Purpose_Pitch')
Out[8]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security.

Sinker

----------------------------------------------------------------------

y-axis slider: [ output value ]
x-axis alider: [ sample order by similarity ]

The most impactful feature contributing to a prediction of Sinker tends to be p(pitch_type_Sinker | pitcher_id).

----------------------------------------------------------------------

y-axis slider: [ p(pitch_type_Sinker | pitcher_id) effects ]
x-axis alider: [ p(pitch_type_Sinker | pitcher_id) ]

The median value for p(pitch_type_Sinker | pitcher_id) is 0. Any positive contribution from this feature will increase the probability classification for Sinker.

----------------------------------------------------------------------

In [9]:
X = feat_eng.transform(sample_sinker)
clf.predict_explain(X, 'Sinker')
Out[9]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security.

Fastball

----------------------------------------------------------------------

y-axis slider: [ output value ]
x-axis alider: [ sample order by similarity ]

Fastball is the most probable class (46.3% of all data points). There is a lot of noise in the shapley value feature contributions.

----------------------------------------------------------------------

y-axis slider: [ p(pitch_type_Fastball | pitcher_id) effects ]
x-axis alider: [ p(pitch_type_Fastball | pitcher_id) ]

The median value for p(pitch_type_Fastball | pitcher_id) is 0.53. Any value greater than 0.53 will increase the probability classification for Fastball, and any value less than 0.53 (especially when less than 0.23) will decrease the probability classification for Fastball.

----------------------------------------------------------------------

In [10]:
X = feat_eng.transform(sample_fastball)
clf.predict_explain(X, 'Fastball')
Out[10]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security.

Dependencies

The code used in this repository is tested on Python 2.7.13 and requires the following packages:

  • dill (≥0.2.7.1)
  • featuretools (≥0.1.16)
  • iml (==0.3.5)*
  • lightgbm (≥2.0.11)
  • numpy (≥1.14.0)*
  • pandas (≥0.22.0)*
  • pandas-profiling (≥1.4.0)
  • scikit-learn (≥0.19.1)*
  • selenium (≥3.7.0)
  • shap (==0.8.5)*

* = required for ShapleyClassifier

Interested in utilizing the API for ShapleyClassifier for your own classification problems?
ShapleyClassifier has an API that generalizes to any classification model with sklearn compatibility.

Questions? contact me at alvinthai@gmail.com