Item Response Theory
Tutorial: Modeling NBA foul calls using Item Response Theory
This tutorial demonstrates how to use Bean Machine to predict when NBA players will receive a foul call from a referee. This model and exposition is based on Austin Rochford's 2018 analysis of the 2015/2016 NBA season games data.
Learning Outcomes
On completion of this tutorial, you should be able:
- to prepare data for running an Item Response Theory model (IRT) with Bean Machine;
- to execute an IRT model with Bean Machine;
- to show the advantage of an IRT model over regular regression;
- to run diagnostics to understand what Bean Machine is doing; and
- to generalize the techniques demonstrated to build an IRT model based on new data.
Prerequisites
We will be using the following packages within this tutorial.
# Install Bean Machine in Colab if using Colab.
import sys
if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
!pip install beanmachine
import os
import warnings
import arviz as az
import beanmachine.ppl as bm
import torch
import torch.distributions as dist
from arviz.utils import Numba
from beanmachine.ppl.model import RVIdentifier
from beanmachine.tutorials.utils import nba
from bokeh.io import output_notebook
from bokeh.plotting import gridplot, show
from IPython.display import Markdown
The next cell includes convenient configuration settings to improve the notebook
presentation as well as setting a manual seed for torch
for reproducibility.
# Eliminate excess UserWarnings from ArviZ.
warnings.simplefilter("ignore")
# Plotting settings
az.rcParams["plot.backend"] = "bokeh"
az.rcParams["stats.hdi_prob"] = 0.89
# Manual seed for torch
torch.manual_seed(1199)
# Other settings for the notebook.\
Numba.disable_numba()
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ
Data
Our data consist of every foul recorded in the last two minutes of every NBA game for the 2015/16 and 2016/17 seasons. The data consists of records for when the foul happened, which player committed it and against which other player, the current score when it happened, and the team each player was playing for.
Raw data can be found at 2015/2016 NBA season games, and the official
data is from Last 2 minute reports. Here we will load data that has been
cleaned for the tutorial. The following columns and their descriptions are given below.
As we treat many of the columns as categoricals, an _id
column is often added that is
the categorical encoding of the respective column.
Column name | Description |
---|---|
seconds_left | Number of seconds remaining in the game. |
call_type | Type of call made (all are classified as fouls). |
call_type_id | ID for the call. |
foul_called | 0 = no-foul, 1 = foul. |
committing_player | Name of the committing player. |
committing_player_id | ID of the committing player. |
disadvantaged_player | Name of the disadvantaged player. |
disadvantaged_player_id | ID of the disadvantaged player. |
score_committing | Team's score of the committing player. |
score_disadvantaged | Team's score of the disadvantaged player. |
season | Season name; "2015-2016" or "2016-2017". |
season_id | ID of the seson. |
trailing_committing | Flag to indicate if the trailing team committed the foul. |
score_diff | Difference between the disadvantaged and committing team's scores. |
trailing_poss | Number of trailing possessions. |
trailing_poss_id | ID for trailing possessions. |
remaining_poss | Remaining number of possessions. |
remaining_poss_id | ID for the remaining number of possessions. |
df = nba.load_data()
Markdown(df.head().to_markdown())
seconds_left | call_type | call_type_id | foul_called | committing_player | committing_player_id | disadvantaged_player | disadvantaged_player_id | score_committing | score_disadvantaged | season | season_id | trailing_committing | score_diff | trailing_poss | trailing_poss_id | remaining_poss | remaining_poss_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 89 | Shooting | 4 | 1 | Ian Mahinmi | 162 | DeMar DeRozan | 98 | 99 | 106 | 2015-2016 | 0 | 1 | 7 | 3 | 7 | 4 | 3 |
1 | 73 | Shooting | 4 | 0 | Bismack Biyombo | 36 | Paul George | 358 | 106 | 99 | 2015-2016 | 0 | 0 | -7 | -2 | 2 | 3 | 2 |
2 | 38 | Loose Ball | 1 | 1 | Jordan Hill | 229 | Jonas Valanciunas | 222 | 99 | 106 | 2015-2016 | 0 | 1 | 7 | 3 | 7 | 2 | 1 |
3 | 30 | Offensive | 2 | 0 | Jordan Hill | 229 | DeMar DeRozan | 98 | 99 | 106 | 2015-2016 | 0 | 1 | 7 | 3 | 7 | 2 | 1 |
4 | 24 | Offensive | 2 | 0 | Jordan Hill | 229 | DeMarre Carroll | 100 | 99 | 106 | 2015-2016 | 0 | 1 | 7 | 3 | 7 | 1 | 0 |
Below is a figure showing the number of fouls within our data, and the observed foul call rate.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
foul_types_plot = nba.plot_foul_types(df["call_type"].value_counts())
show(foul_types_plot)
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
call_type_plot = nba.plot_call_type_means(df.groupby("call_type").mean()["foul_called"])
show(call_type_plot)
Next we investigate the frequency with which fouls are called for each of the NBA seasons in our data. The difference between the foul rates in the 2015/16 and 2016/17 seasons is easily shown below. We will account for this in our first model.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
foul_freq_plot = nba.plot_foul_frequency(df.groupby("season").mean()["foul_called"])
show(foul_freq_plot)
Basic model
The initial model is a hierarchical logistic regression with a per-season latent variable to predict foul calls. This model just tries to predict how likely a play is to lead to a foul purely based on what season it is. There are two variables, one for each season. We include a variable for every game, but set it for now equal to associated with which season the play happened during. Finally, we pass into a sigmoid function to effectively turn it into a probability which we pass to a Bernoulli distribution so we can observe which models whether play lead to a foul.
This is model is now roughly equivalent to using LogisticRegression
in scikit-learn
where the only feature is which season did the foul occur during.
NOTE
We can implement this model in Bean Machine by defining random variable
objects with the @bm.random_variable
and @bm.functional
decorators. These functions
behave differently than ordinary Python functions.
@bm.random_variable
functions:- They must return PyTorch
Distribution
objects. - Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
- Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
- Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
- Please see the documentation for more information about this decorator.
@bm.functional
:- This decorator is used to deterministically transform the results of one or more random variables.
class BasicModel:
def __init__(self, df):
self.df = df
self.n_season = len(self.df["season"].unique())
def __repr__(self):
return ""
@bm.random_variable
def beta(self) -> RVIdentifier:
return dist.Normal(0.0, 5.0).expand((self.n_season,))
@bm.functional
def p(self) -> RVIdentifier:
b_season = self.beta()
return torch.sigmoid(b_season)
@bm.random_variable
def y(self) -> RVIdentifier:
return dist.Bernoulli(self.p()[self.df["season_id"]])
We make use of the expand
method on pytorch distribution objects to perform batch
sampling. This has the advantage of being more computationally-efficient than explicit
iteration. Bean Machine is also able to take advantage of the implicit broadcasting of
pytorch tensors and pytorch samplers like dist.Bernoulli
allowing us to sample or
observe all values at once.
Inference
Inference is the process of combining model and data to obtain insights, in the form of probability distributions over values of interest. Bean Machine offers a powerful and general inference framework to enable fitting arbitrary models to data.
Our inference algorithms expect a list of query variables and observations in the form
of a dictionary. The query list and the keys for the observation dictionary should
consist of @bm.random_variable
invocations as keys, and tensor data as values. You can
see this in the example below.
basic_model = BasicModel(df)
basic_model_queries = [basic_model.p()]
basic_model_observations = {
basic_model.y(): torch.tensor(df["foul_called"].astype(float).values)
}
Now, we're ready to run inference! Although Bean Machine supports a rich library of
inference methods, they all support a common infer
method, with these arguments:
Name | Usage |
---|---|
queries | List of @bm.random_variable targets to fit posterior distributions for. |
observations | A dictionary of observations, as built above. |
num_samples | Number of Monte Carlo samples to approximate the posterior distributions for the variables in queries. |
num_chains | Number of separate inference runs to use. Multiple chains can help verify that inference ran correctly. |
For the models in this tutorial we will use the GlobalNoUTurnSampler
inference
algorithm, as it is particularly well-suited to handle hierarchical models with partial
pooling and without discrete latent variables.
num_samples = 1 if smoke_test else 1000
num_adaptive_samples = 0 if smoke_test else num_samples // 2
num_chains = 1 if smoke_test else 3
# Experimental backend using the Pytorch NNC compiler
nnc_compile = "SANDCASTLE_NEXUS" not in os.environ
basic_samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(
queries=basic_model_queries,
observations=basic_model_observations,
num_samples=num_samples,
num_chains=num_chains,
num_adaptive_samples=num_adaptive_samples,
)
Analysis
basic_samples
now contains our inference results. We begin our analysis by printing
out summary statistics. Two important statistics to take note of are the
(r_hat
) and effective sample size (ess
) values in the below dataframe, see
Vehtari et al.
basic_trace = basic_samples.to_inference_data()
basic_summary_df = az.summary(basic_trace, round_to=3)
Markdown(basic_summary_df.to_markdown())
mean | sd | hdi_5.5% | hdi_94.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
p(,)[0] | 0.414 | 0.008 | 0.402 | 0.426 | 0 | 0 | 2686.41 | 2195.17 | 1.001 |
p(,)[1] | 0.31 | 0.007 | 0.299 | 0.32 | 0 | 0 | 2501.73 | 1954.62 | 1.005 |
Measuring variance between and within chains with
is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then , since the between-chain and within-chain variance should be equal. is calculated from samples as
where is the within-chain variance, is the between-chain variance and is the estimate of the posterior variance of the samples. The take-away here is that converges to 1 when each of the chains begins to empirically approximate the same posterior distribution. We do not recommend using inference results if . More information about can be found in the Vehtari et al paper.
Effective sample size
MCMC samplers do not draw truly independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper. In brief, it is a measure that combines information from the value with the autocorrelation estimates within the chains.
ESS estimates come in two variants, ess_bulk
and ess_tail
. The former is the
default, but the latter can be useful if you need good estimates of the tails of your
posterior distribution. The rule of thumb for ess_bulk
is for this value to be greater
than 100 per chain on average. Since we ran four chains, we need ess_bulk
to be
greater than 400 for each parameter. The ess_tail
is an estimate for effectively
independent samples considering the more extreme values of the posterior. This is not
the number of samples that landed in the tails of the posterior, but rather a measure of
the number of effectively independent samples if we sampled the tails of the posterior.
The rule of thumb for this value is also to be greater than 100 per chain on average.
Posterior plot of the Basic model
With samples in hand, we can observe the posteriors of s and see that fouls are slightly more likely during the earlier NBA season.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
basic_model_posterior_plot = gridplot(list(az.plot_posterior(basic_trace, show=False)))
show(basic_model_posterior_plot)
Below we have two diagnostic plots for individual random variables: Rank plots and autocorrelation plots.
Rank plots are a histogram of the samples over time. All samples across all chains are ranked and then we plot the average rank for each chain on regular intervals. If the chains are mixing well this histogram should look roughly uniform. If it looks highly irregular that suggests chains might be getting stuck and not adequately exploring the sample space.
Autocorrelation plots measure how predictive the last several samples are of the current sample. Autocorrelation may vary between -1.0 (deterministically anticorrelated) and 1.0 (deterministically correlated). We compute autocorrelation approximately, so it may sometimes exceed these bounds. In an ideal world, the current sample is chosen independently of the previous samples: an autocorrelation of zero. This is not possible in practice, due to stochastic noise and the mechanics of how inference works.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
str_trace = basic_trace.rename({basic_model.p(): str(basic_model.p())})
str_trace_plot = gridplot(list(az.plot_trace(str_trace, kind="rank_bars", show=False)))
show(str_trace_plot)
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
basic_autocorr_plot = gridplot(list(az.plot_autocorr(basic_trace, show=False)))
show(basic_autocorr_plot)
Now let's take our basic model and measure how well it predicts when fouls are called. As we can see, because the prediction can only depend on the season it doesn't do the greatest.
phat = basic_samples[basic_model.p()]
resid_df = df.assign(
p_hat=phat.numpy().mean(axis=0)[:, df.season_id].mean(axis=0)
).assign(resid=lambda df: df["foul_called"] - df["p_hat"])
idx = resid_df.index.tolist()[:5] + resid_df.index.tolist()[-5:]
Markdown(resid_df[["season", "foul_called", "p_hat", "resid"]].loc[idx].to_markdown())
season | foul_called | p_hat | resid | |
---|---|---|---|---|
0 | 2015-2016 | 1 | 0.414419 | 0.585581 |
1 | 2015-2016 | 0 | 0.414419 | -0.414419 |
2 | 2015-2016 | 1 | 0.414419 | 0.585581 |
3 | 2015-2016 | 0 | 0.414419 | -0.414419 |
4 | 2015-2016 | 0 | 0.414419 | -0.414419 |
8625 | 2016-2017 | 1 | 0.309528 | 0.690472 |
8626 | 2016-2017 | 0 | 0.309528 | -0.309528 |
8627 | 2016-2017 | 1 | 0.309528 | 0.690472 |
8628 | 2016-2017 | 0 | 0.309528 | -0.309528 |
8629 | 2016-2017 | 1 | 0.309528 | 0.690472 |
basic_model_resid_df = resid_df.pivot_table("resid", "season")
Markdown(basic_model_resid_df.to_markdown())
season | resid |
---|---|
2015-2016 | -0.000241572 |
2016-2017 | -0.000126373 |
So what other features could we use? Let's plot residuals versus how many seconds are left in the game. If there is any trend in this plot, we can try to incorporate it into our model and likely improve the quality of our predictions.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
temp_df = resid_df.groupby("seconds_left").mean()
basic_resid_plot = nba.plot_basic_model_residuals(temp_df)
show(basic_resid_plot)
Improving on the Basic Model
We can get a sense of how to improve the model by noticing that it is the team that is behind that has a greater incentive to make a foul.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
trailing_team_committing_plot = nba.plot_trailing_team_committing(df)
show(trailing_team_committing_plot)
Let's model that difference as a feature. To drive home how significant the trailing team's observed foul rate is within the final two minutes of a game, we will plot the foul rate vs the number of trailing possessions over time. Trailing possessions is the number of points the trailing team would need to make before the shot-clock times out, which is 20 seconds.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
trailing_poss_plot = nba.plot_trailing_possessions(df)
show(trailing_poss_plot)
As we can see, teams behind by a single possession have a lower observed foul rate than those that are behind by three possessions.
We also notice that certain calls are much more associated with fouls. We will include this in our model as well.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
type_plot = nba.plot_call_type_means(df.groupby("call_type").mean()["foul_called"])
show(type_plot)
Possesion model
In the improved model, we explore if who has possession of the ball determines how likely it is for a foul to be committed. This model incorporates a latent variable for each call-type, as well as latent variable for each combination of call type , how many possessions the committing team is trailing by , and how many possessions are left in the game. Together all these variables are added to create a score which is converted into a probability of a foul call , which is finally called into a Bernoulli to make our observation.
class PossesionModel:
def __init__(self, df):
self.df = df
self.n_season = len(self.df["season"].unique())
self.n_call_type = len(self.df["call_type"].unique())
self.n_trailing_poss = len(self.df["trailing_poss"].unique())
self.n_remaining_poss = len(self.df["remaining_poss"].unique())
def __repr__(self):
return ""
@bm.random_variable
def beta_season(self) -> RVIdentifier:
return dist.Normal(0.0, 5.0).expand((self.n_season,))
@bm.random_variable
def sigma_call(self) -> RVIdentifier:
return dist.HalfNormal(5.0)
@bm.random_variable
def beta_call(self) -> RVIdentifier:
return dist.Normal(0, 1).expand((self.n_call_type,))
@bm.random_variable
def sigma_poss(self) -> RVIdentifier:
return dist.HalfNormal(5.0).expand((1, 1, self.n_call_type))
@bm.random_variable
def beta_poss(self) -> RVIdentifier:
return dist.Normal(0, 1).expand(
(self.n_trailing_poss, self.n_remaining_poss, self.n_call_type)
)
@bm.functional
def p(self) -> RVIdentifier:
b_season = self.beta_season()
b_call = self.beta_call() * self.sigma_call()
b_poss = self.beta_poss() * self.sigma_poss()
eta_game = (
b_season[self.df["season_id"].values]
+ b_call[self.df["call_type_id"].values]
+ b_poss[
self.df["trailing_poss_id"].values,
self.df["remaining_poss_id"].values,
self.df["call_type_id"].values,
]
)
return torch.sigmoid(eta_game)
@bm.random_variable
def y(self) -> RVIdentifier:
return dist.Bernoulli(self.p())
You may have noticed that the Bean Machine program is not a direct translation of the model. We do this since the NUTS sampler works best when most of the latent variables are at roughly the same scale. So instead of sampling from , we sample from and assign . The effective distribution of remains the same, but the scale of the algorithmically sampled random variable is no longer directly dependent on . This is called a non-centered reparameterization and is an encouraged change to make to any model you intend to use with NUTS.
poss_model = PossesionModel(df)
poss_model_queries = [poss_model.p()]
poss_model_observations = {
poss_model.y(): torch.tensor(df["foul_called"].astype(float).values)
}
poss_samples = bm.GlobalNoUTurnSampler(target_accept_prob=0.90, nnc_compile=nnc_compile).infer(
queries=poss_model_queries,
observations=poss_model_observations,
num_samples=num_samples,
num_chains=num_chains,
num_adaptive_samples=num_adaptive_samples,
)
poss_trace = poss_samples.to_inference_data()
When we check diagnostics, we see the is decent and isn't too high.
poss_summary_df = az.summary(poss_trace, round_to=3)
idx = poss_summary_df.index.tolist()[:5] + poss_summary_df.index.tolist()[-5:]
Markdown(poss_summary_df.loc[idx].to_markdown())
mean | sd | hdi_5.5% | hdi_94.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
p(,)[0] | 0.509 | 0.056 | 0.42 | 0.598 | 0.014 | 0.01 | 16.027 | 235.105 | 1.124 |
p(,)[1] | 0.416 | 0.049 | 0.335 | 0.497 | 0.011 | 0.008 | 19.563 | 58.273 | 1.109 |
p(,)[2] | 0.406 | 0.044 | 0.342 | 0.468 | 0.004 | 0.003 | 123.779 | 99.481 | 1.032 |
p(,)[3] | 0.213 | 0.04 | 0.15 | 0.266 | 0.01 | 0.007 | 14.438 | 433.81 | 1.152 |
p(,)[4] | 0.211 | 0.042 | 0.154 | 0.274 | 0.011 | 0.008 | 16.576 | 71.769 | 1.126 |
p(,)[8625] | 0.849 | 0.018 | 0.82 | 0.875 | 0.001 | 0.001 | 231.829 | 561.624 | 1.02 |
p(,)[8626] | 0.27 | 0.031 | 0.215 | 0.316 | 0.003 | 0.002 | 84.273 | 163.17 | 1.032 |
p(,)[8627] | 0.849 | 0.018 | 0.82 | 0.875 | 0.001 | 0.001 | 231.829 | 561.624 | 1.02 |
p(,)[8628] | 0.258 | 0.035 | 0.208 | 0.314 | 0.009 | 0.006 | 15.719 | 32.625 | 1.124 |
p(,)[8629] | 0.299 | 0.031 | 0.253 | 0.349 | 0.004 | 0.003 | 55.64 | 272.819 | 1.058 |
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
poss_autocorr_plot = gridplot(list(az.plot_autocorr(poss_trace, show=False)))
show(poss_autocorr_plot)
phat = poss_samples[poss_model.p()]
We also check if the per-season residuals have improved.
resid_df = df.assign(p_hat=phat.numpy().mean(axis=0).mean(axis=0)).assign(
resid=lambda df: df.foul_called - df.p_hat
)
print("Possession model")
possession_model_resid_df = resid_df.pivot_table("resid", "season")
display(Markdown(possession_model_resid_df.to_markdown()))
print("Basic model")
display(Markdown(basic_model_resid_df.to_markdown()))
season | resid |
---|---|
2015-2016 | 0.000231191 |
2016-2017 | -0.000212147 |
season | resid |
---|---|
2015-2016 | -0.000241572 |
2016-2017 | -0.000126373 |
We also check if our residuals are sensitive to how many seconds are left in the game. Since they are not, it means our possession model better captures the connection between fouls and how many seconds remain in the game.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
poss_resid_plot = nba.plot_possession_model_residuals(resid_df)
show(poss_resid_plot)
We now compare these the possession model to the basic model. To do this we will use the Watanabe-Akaike or widely available information criterion (WAIC). An information criterion is a way of scoring how well does a model explain some data. WAIC in particular is defined as:
We can approximate the expectations in this equation using samples from our posterior.
When we compare these models we see that the possession model better explains the data and is given a higher rank.
Markdown(az.compare({"possession": poss_trace, "basic": basic_trace}, ic="waic").to_markdown())
rank | waic | p_waic | d_waic | weight | se | dse | warning | waic_scale | |
---|---|---|---|---|---|---|---|---|---|
possession | 0 | -4849.7 | 76.6483 | 0 | 0.5 | 0 | 0 | True | log |
basic | 1 | -5576.19 | 0.911201 | 726.494 | 0.5 | 0 | 0 | True | log |
Player item-response theory model
We now try to improve our model once more by modeling the players involved in the foul calls. To do this we will associate with each player two latent variables, a propensity to foul another player , and a propensity to be fouled by another player . We can then have of a latent variable that for pairs of players and that models that interaction . We further model this propensity on a per-season basis. This new factor is added to the from the possession model where the sum is passed into a sigmoid to turn it into a probability . As in all the other models, this is passed to a Bernoulli which is observed.
class IRTModel:
def __init__(self, df):
self.df = df
self.n_season = len(self.df["season"].unique())
self.n_player = len(
set(df["committing_player"].tolist() + df["disadvantaged_player"].tolist())
)
self.n_call_type = len(self.df["call_type"].unique())
self.n_trailing_poss = len(self.df["trailing_poss"].unique())
self.n_remaining_poss = len(self.df["remaining_poss"].unique())
def __repr__(self):
return ""
@bm.random_variable
def sigma_a(self) -> RVIdentifier:
return dist.HalfNormal(5.0)
@bm.random_variable
def a_player(self) -> RVIdentifier:
return dist.Normal(0.0, 1.0).expand((self.n_player, self.n_season))
@bm.random_variable
def sigma_b(self) -> RVIdentifier:
return dist.HalfNormal(1.0)
@bm.random_variable
def b_player(self) -> RVIdentifier:
return dist.Normal(0.0, 1.0).expand((self.n_player, self.n_season))
@bm.random_variable
def beta_season(self) -> RVIdentifier:
return dist.Normal(0.0, 5.0).expand((self.n_season,))
@bm.random_variable
def sigma_call(self) -> RVIdentifier:
return dist.HalfNormal(5.0)
@bm.random_variable
def beta_call(self) -> RVIdentifier:
return dist.Normal(0, 1).expand((self.n_call_type,))
@bm.random_variable
def sigma_poss(self) -> RVIdentifier:
return dist.HalfNormal(5.0).expand((1, 1, self.n_call_type))
@bm.random_variable
def beta_poss(self) -> RVIdentifier:
return dist.Normal(0, 1).expand(
(self.n_trailing_poss, self.n_remaining_poss, self.n_call_type)
)
@bm.functional
def p(self) -> RVIdentifier:
b_season = self.beta_season()
b_call = self.beta_call() * self.sigma_call()
b_poss = self.beta_poss() * self.sigma_poss()
a_player = self.a_player() * self.sigma_a()
b_player = self.b_player() * self.sigma_b()
season = self.df["season_id"].values
player_disadvantaged = self.df["disadvantaged_player_id"].values
player_committing = df["committing_player_id"].values
eta_player = (
a_player[player_disadvantaged, season] - b_player[player_committing, season]
)
eta_game = (
b_season[season]
+ b_call[self.df["call_type_id"].values]
+ b_poss[
self.df["trailing_poss_id"].values,
self.df["remaining_poss_id"].values,
self.df["call_type_id"].values,
]
) + eta_player
return torch.sigmoid(eta_game)
@bm.random_variable
def y(self) -> RVIdentifier:
return dist.Bernoulli(self.p())
num_samples = 1 if smoke_test else 1500
num_adaptive_samples = 1 if smoke_test else num_samples // 2
irt_model = IRTModel(df)
observations = {irt_model.y(): torch.tensor(df["foul_called"].astype(float).values)}
irt_samples = bm.GlobalNoUTurnSampler(target_accept_prob=0.90, nnc_compile=nnc_compile).infer(
queries=[irt_model.p()],
observations=observations,
num_samples=num_samples,
num_chains=num_chains,
num_adaptive_samples=num_adaptive_samples,
)
irt_trace = irt_samples.to_inference_data()
irt_summary_df = az.summary(irt_trace)
idx = irt_summary_df.index.tolist()[:5] + irt_summary_df.index.tolist()[-5:]
Markdown(irt_summary_df.loc[idx].to_markdown())
mean | sd | hdi_5.5% | hdi_94.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
p(,)[0] | 0.414 | 0.086 | 0.279 | 0.551 | 0.001 | 0.001 | 5265 | 3289 | 1 |
p(,)[1] | 0.361 | 0.077 | 0.233 | 0.481 | 0.001 | 0.001 | 7964 | 3489 | 1 |
p(,)[2] | 0.398 | 0.079 | 0.265 | 0.515 | 0.001 | 0.001 | 6713 | 3376 | 1 |
p(,)[3] | 0.164 | 0.051 | 0.084 | 0.238 | 0.001 | 0.001 | 4292 | 3323 | 1 |
p(,)[4] | 0.18 | 0.053 | 0.091 | 0.255 | 0.001 | 0.001 | 5795 | 3182 | 1 |
p(,)[8625] | 0.851 | 0.043 | 0.787 | 0.917 | 0.001 | 0 | 7310 | 2997 | 1 |
p(,)[8626] | 0.254 | 0.066 | 0.146 | 0.356 | 0.001 | 0.001 | 8382 | 3477 | 1 |
p(,)[8627] | 0.85 | 0.043 | 0.787 | 0.918 | 0 | 0 | 8100 | 3317 | 1 |
p(,)[8628] | 0.278 | 0.068 | 0.161 | 0.378 | 0.001 | 0.001 | 7627 | 3248 | 1 |
p(,)[8629] | 0.267 | 0.064 | 0.163 | 0.365 | 0.001 | 0.001 | 7294 | 3129 | 1 |
The effective sample size is decently large and the is close to 1, hence it seems likely that this model fitted the data well using NUTS.
Let's see how it compares with the previous models we defined and explored.
phat = irt_samples[irt_model.p()]
resid_df = df.assign(p_hat=phat.numpy().mean(axis=0).mean(axis=0)).assign(
resid=lambda df: df.foul_called - df.p_hat
)
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
irt_resid_plot = nba.plot_irt_residuals(resid_df)
show(irt_resid_plot)
irt_resid_df = resid_df.pivot_table("resid", "season")
print("Possession model")
display(Markdown(possession_model_resid_df.to_markdown()))
print("Basic model")
display(Markdown(basic_model_resid_df.to_markdown()))
print("IRT model")
display(Markdown(irt_resid_df.to_markdown()))
season | resid |
---|---|
2015-2016 | 0.000231191 |
2016-2017 | -0.000212147 |
season | resid |
---|---|
2015-2016 | -0.000241572 |
2016-2017 | -0.000126373 |
season | resid |
---|---|
2015-2016 | -0.000191984 |
2016-2017 | 0.000177422 |
The residuals are not demonstratably smaller than the possession model, but they are still smaller than the basic model. Finally, let's compare all three models using WAIC.
Markdown(
az.compare(
{"irt": irt_trace, "possession": poss_trace, "basic": basic_trace},
ic="waic",
).to_markdown()
)
rank | waic | p_waic | d_waic | weight | se | dse | warning | waic_scale | |
---|---|---|---|---|---|---|---|---|---|
possession | 0 | -4849.7 | 76.6483 | 0 | 0.333333 | 0 | 0 | True | log |
irt | 1 | -5087.1 | 439.29 | 237.402 | 0.333333 | 0 | 0 | True | log |
basic | 2 | -5576.19 | 0.911201 | 726.494 | 0.333333 | 0 | 0 | True | log |
We can see the IRT model is slightly better than the possession model, and both are better than the basic model.
Conclusion
With this tutorial we analyzed NBA foul data and came up with three models to predict when fouls happen. The best model explicitly models the players themselves and their own propensity to foul and be fouled. IRT models have a rich history in psychometrics and educational testing. If you have ever taken a standardized test, an IRT model was used to convert your test answers into an inference about your aptitude in math or reading comprehension.
References
- Austin Rochford's 2018 analysis https://austinrochford.com/posts/2018-02-04-nba-irt-2.html
- 2015/2016 NBA season games https://github.com/polygraph-cool/last-two-minute-report
- Last 2 minute reports http://official.nba.com/2017-18-nba-officiating-last-two-minute-reports/
- Practical Issues in Implementing and Understanding Bayesian Ideal Point Estimation
- Bayesian Item Response Modeling—Theory and Applications
- NBA's Last Two Minute Report
- Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC (2021) Rank-Normalization, Folding, and Localization: An Improved for Assessing Convergence of MCMC (with Discussion). Bayesian Analysis 16(2) 667–718. doi: 10.1214/20-BA1221.