Item Response Theory

Tutorial: Modeling NBA foul calls using Item Response Theory

This tutorial demonstrates how to use Bean Machine to predict when NBA players will receive a foul call from a referee. This model and exposition is based on Austin Rochford's 2018 analysis of the 2015/2016 NBA season games data.

Learning Outcomes

On completion of this tutorial, you should be able:

to prepare data for running an Item Response Theory model (IRT) with Bean Machine;
to execute an IRT model with Bean Machine;
to show the advantage of an IRT model over regular regression;
to run diagnostics to understand what Bean Machine is doing; and
to generalize the techniques demonstrated to build an IRT model based on new data.

Prerequisites

We will be using the following packages within this tutorial.

arviz and bokeh for interactive visualizations.

# Install Bean Machine in Colab if using Colab.
import sys


if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
    !pip install beanmachine

import os
import warnings

import arviz as az
import beanmachine.ppl as bm
import torch
import torch.distributions as dist
from arviz.utils import Numba
from beanmachine.ppl.model import RVIdentifier
from beanmachine.tutorials.utils import nba
from bokeh.io import output_notebook
from bokeh.plotting import gridplot, show
from IPython.display import Markdown

The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for torch for reproducibility.

# Eliminate excess UserWarnings from ArviZ.
warnings.simplefilter("ignore")

# Plotting settings
az.rcParams["plot.backend"] = "bokeh"
az.rcParams["stats.hdi_prob"] = 0.89

# Manual seed for torch
torch.manual_seed(1199)

# Other settings for the notebook.\
Numba.disable_numba()
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ

Data

Our data consist of every foul recorded in the last two minutes of every NBA game for the 2015/16 and 2016/17 seasons. The data consists of records for when the foul happened, which player committed it and against which other player, the current score when it happened, and the team each player was playing for.

Raw data can be found at 2015/2016 NBA season games, and the official data is from Last 2 minute reports. Here we will load data that has been cleaned for the tutorial. The following columns and their descriptions are given below. As we treat many of the columns as categoricals, an _id column is often added that is the categorical encoding of the respective column.

Column name	Description
seconds_left	Number of seconds remaining in the game.
call_type	Type of call made (all are classified as fouls).
call_type_id	ID for the call.
foul_called	0 = no-foul, 1 = foul.
committing_player	Name of the committing player.
committing_player_id	ID of the committing player.
disadvantaged_player	Name of the disadvantaged player.
disadvantaged_player_id	ID of the disadvantaged player.
score_committing	Team's score of the committing player.
score_disadvantaged	Team's score of the disadvantaged player.
season	Season name; "2015-2016" or "2016-2017".
season_id	ID of the seson.
trailing_committing	Flag to indicate if the trailing team committed the foul.
score_diff	Difference between the disadvantaged and committing team's scores.
trailing_poss	Number of trailing possessions.
trailing_poss_id	ID for trailing possessions.
remaining_poss	Remaining number of possessions.
remaining_poss_id	ID for the remaining number of possessions.

df = nba.load_data()
Markdown(df.head().to_markdown())

	seconds_left	call_type	call_type_id	foul_called	committing_player	committing_player_id	disadvantaged_player	disadvantaged_player_id	score_committing	score_disadvantaged	season	trailing_committing	score_diff	trailing_poss	trailing_poss_id	remaining_poss	remaining_poss_id
0	89	Shooting	4	1	Ian Mahinmi	162	DeMar DeRozan	98	99	106	2015-2016	1	7	3	7	4	3
1	73	Shooting	4	0	Bismack Biyombo	36	Paul George	358	106	99	2015-2016	0	-7	-2	2	3	2
2	38	Loose Ball	1	1	Jordan Hill	229	Jonas Valanciunas	222	99	106	2015-2016	1	7	3	7	2	1
3	30	Offensive	2	0	Jordan Hill	229	DeMar DeRozan	98	99	106	2015-2016	1	7	3	7	2	1
4	24	Offensive	2	0	Jordan Hill	229	DeMarre Carroll	100	99	106	2015-2016	1	7	3	7	1	0

Below is a figure showing the number of fouls within our data, and the observed foul call rate.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

foul_types_plot = nba.plot_foul_types(df["call_type"].value_counts())
show(foul_types_plot)

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

call_type_plot = nba.plot_call_type_means(df.groupby("call_type").mean()["foul_called"])
show(call_type_plot)

Next we investigate the frequency with which fouls are called for each of the NBA seasons in our data. The difference between the foul rates in the 2015/16 and 2016/17 seasons is easily shown below. We will account for this in our first model.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

foul_freq_plot = nba.plot_foul_frequency(df.groupby("season").mean()["foul_called"])
show(foul_freq_plot)

Basic model

The initial model is a hierarchical logistic regression with a per-season latent variable to predict foul calls. This model just tries to predict how likely a play $k$ is to lead to a foul purely based on what season it is. There are two $\beta^{\textrm{season}}_s$ variables, one for each season. We include a $\eta^{\textrm{game}}_k$ variable for every game, but set it for now equal to $\beta^{\textrm{season}}_{s(k)}$ associated with which season the play happened during. Finally, we pass $eta^{\textrm{game}}_k$ into a sigmoid function to effectively turn it into a probability $p_k$ which we pass to a Bernoulli distribution so we can observe $y_k$ which models whether play $k$ lead to a foul.

\begin{aligned} \beta^{\textrm{season}}_s & \sim N(0, 5) \\ \eta^{\textrm{game}}_k & = \beta^{\textrm{season}}_{s(k)} \\ p_k & = \textrm{sigmoid}\left(\eta^{\textrm{game}}_k\right) \\ y_k & \sim \textrm{Bernoulli}(p_k) . \end{aligned}

This is model is now roughly equivalent to using LogisticRegression in scikit-learn where the only feature is which season did the foul occur during.

NOTE We can implement this model in Bean Machine by defining random variable objects with the @bm.random_variable and @bm.functional decorators. These functions behave differently than ordinary Python functions.

Semantics for @bm.random_variable functions:

They must return PyTorch Distribution objects.
Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
Please see the documentation for more information about this decorator.

Semantics for @bm.functional:

This decorator is used to deterministically transform the results of one or more random variables.

class BasicModel:
    def __init__(self, df):
        self.df = df
        self.n_season = len(self.df["season"].unique())

    def __repr__(self):
        return ""

    @bm.random_variable
    def beta(self) -> RVIdentifier:
        return dist.Normal(0.0, 5.0).expand((self.n_season,))

    @bm.functional
    def p(self) -> RVIdentifier:
        b_season = self.beta()
        return torch.sigmoid(b_season)

    @bm.random_variable
    def y(self) -> RVIdentifier:
        return dist.Bernoulli(self.p()[self.df["season_id"]])

We make use of the expand method on pytorch distribution objects to perform batch sampling. This has the advantage of being more computationally-efficient than explicit iteration. Bean Machine is also able to take advantage of the implicit broadcasting of pytorch tensors and pytorch samplers like dist.Bernoulli allowing us to sample or observe all $y$ values at once.

Inference

Inference is the process of combining model and data to obtain insights, in the form of probability distributions over values of interest. Bean Machine offers a powerful and general inference framework to enable fitting arbitrary models to data.

Our inference algorithms expect a list of query variables and observations in the form of a dictionary. The query list and the keys for the observation dictionary should consist of @bm.random_variable invocations as keys, and tensor data as values. You can see this in the example below.

basic_model = BasicModel(df)
basic_model_queries = [basic_model.p()]
basic_model_observations = {
    basic_model.y(): torch.tensor(df["foul_called"].astype(float).values)
}

Now, we're ready to run inference! Although Bean Machine supports a rich library of inference methods, they all support a common infer method, with these arguments:

Name	Usage
`queries`	List of `@bm.random_variable` targets to fit posterior distributions for.
`observations`	A dictionary of observations, as built above.
`num_samples`	Number of Monte Carlo samples to approximate the posterior distributions for the variables in queries.
`num_chains`	Number of separate inference runs to use. Multiple chains can help verify that inference ran correctly.

For the models in this tutorial we will use the GlobalNoUTurnSampler inference algorithm, as it is particularly well-suited to handle hierarchical models with partial pooling and without discrete latent variables.

num_samples = 1 if smoke_test else 1000
num_adaptive_samples = 0 if smoke_test else num_samples // 2
num_chains = 1 if smoke_test else 3
# Experimental backend using the Pytorch NNC compiler
nnc_compile = "SANDCASTLE_NEXUS" not in os.environ

basic_samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(
    queries=basic_model_queries,
    observations=basic_model_observations,
    num_samples=num_samples,
    num_chains=num_chains,
    num_adaptive_samples=num_adaptive_samples,
)

Out:

Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]

Analysis

basic_samples now contains our inference results. We begin our analysis by printing out summary statistics. Two important statistics to take note of are the $\hat{R}$ (r_hat) and effective sample size (ess) values in the below dataframe, see Vehtari et al.

basic_trace = basic_samples.to_inference_data()

basic_summary_df = az.summary(basic_trace, round_to=3)
Markdown(basic_summary_df.to_markdown())

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
p(,)[0]	0.414	0.008	0.402	0.426	0	0	2686.41	2195.17	1.001
p(,)[1]	0.31	0.007	0.299	0.32	0	0	2501.73	1954.62	1.005

Measuring variance between and within chains with $\hat{R}$

$\hat{R}$ is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then $\hat{R}\approx 1$ , since the between-chain and within-chain variance should be equal. $\hat{R}$ is calculated from $N$ samples as

\hat{R}=\frac{\hat{V}}{W} \\ \hat{V} = \frac{N-1}{N} W + \frac{1}{N} B

where $W$ is the within-chain variance, $B$ is the between-chain variance and $\hat{V}$ is the estimate of the posterior variance of the samples. The take-away here is that $\hat{R}$ converges to 1 when each of the chains begins to empirically approximate the same posterior distribution. We do not recommend using inference results if $\hat{R}>1.01$ . More information about $\hat{R}$ can be found in the Vehtari et al paper.

Effective sample size $ess$

MCMC samplers do not draw truly independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper. In brief, it is a measure that combines information from the $\hat{R}$ value with the autocorrelation estimates within the chains.

ESS estimates come in two variants, ess_bulk and ess_tail. The former is the default, but the latter can be useful if you need good estimates of the tails of your posterior distribution. The rule of thumb for ess_bulk is for this value to be greater than 100 per chain on average. Since we ran four chains, we need ess_bulk to be greater than 400 for each parameter. The ess_tail is an estimate for effectively independent samples considering the more extreme values of the posterior. This is not the number of samples that landed in the tails of the posterior, but rather a measure of the number of effectively independent samples if we sampled the tails of the posterior. The rule of thumb for this value is also to be greater than 100 per chain on average.

Posterior plot of the Basic model

With samples in hand, we can observe the posteriors of $\beta$ s and see that fouls are slightly more likely during the earlier NBA season.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

basic_model_posterior_plot = gridplot(list(az.plot_posterior(basic_trace, show=False)))
show(basic_model_posterior_plot)

Below we have two diagnostic plots for individual random variables: Rank plots and autocorrelation plots.

Rank plots are a histogram of the samples over time. All samples across all chains are ranked and then we plot the average rank for each chain on regular intervals. If the chains are mixing well this histogram should look roughly uniform. If it looks highly irregular that suggests chains might be getting stuck and not adequately exploring the sample space.
Autocorrelation plots measure how predictive the last several samples are of the current sample. Autocorrelation may vary between -1.0 (deterministically anticorrelated) and 1.0 (deterministically correlated). We compute autocorrelation approximately, so it may sometimes exceed these bounds. In an ideal world, the current sample is chosen independently of the previous samples: an autocorrelation of zero. This is not possible in practice, due to stochastic noise and the mechanics of how inference works.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

str_trace = basic_trace.rename({basic_model.p(): str(basic_model.p())})
str_trace_plot = gridplot(list(az.plot_trace(str_trace, kind="rank_bars", show=False)))
show(str_trace_plot)

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

basic_autocorr_plot = gridplot(list(az.plot_autocorr(basic_trace, show=False)))
show(basic_autocorr_plot)

Now let's take our basic model and measure how well it predicts when fouls are called. As we can see, because the prediction can only depend on the season it doesn't do the greatest.

phat = basic_samples[basic_model.p()]

resid_df = df.assign(
    p_hat=phat.numpy().mean(axis=0)[:, df.season_id].mean(axis=0)
).assign(resid=lambda df: df["foul_called"] - df["p_hat"])

idx = resid_df.index.tolist()[:5] + resid_df.index.tolist()[-5:]
Markdown(resid_df[["season", "foul_called", "p_hat", "resid"]].loc[idx].to_markdown())

	season	foul_called	p_hat	resid
0	2015-2016	1	0.414419	0.585581
1	2015-2016	0	0.414419	-0.414419
2	2015-2016	1	0.414419	0.585581
3	2015-2016	0	0.414419	-0.414419
4	2015-2016	0	0.414419	-0.414419
8625	2016-2017	1	0.309528	0.690472
8626	2016-2017	0	0.309528	-0.309528
8627	2016-2017	1	0.309528	0.690472
8628	2016-2017	0	0.309528	-0.309528
8629	2016-2017	1	0.309528	0.690472

basic_model_resid_df = resid_df.pivot_table("resid", "season")
Markdown(basic_model_resid_df.to_markdown())

season	resid
2015-2016	-0.000241572
2016-2017	-0.000126373

So what other features could we use? Let's plot residuals versus how many seconds are left in the game. If there is any trend in this plot, we can try to incorporate it into our model and likely improve the quality of our predictions.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

temp_df = resid_df.groupby("seconds_left").mean()
basic_resid_plot = nba.plot_basic_model_residuals(temp_df)
show(basic_resid_plot)

Improving on the Basic Model

We can get a sense of how to improve the model by noticing that it is the team that is behind that has a greater incentive to make a foul.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

trailing_team_committing_plot = nba.plot_trailing_team_committing(df)
show(trailing_team_committing_plot)

Let's model that difference as a feature. To drive home how significant the trailing team's observed foul rate is within the final two minutes of a game, we will plot the foul rate vs the number of trailing possessions over time. Trailing possessions is the number of points the trailing team would need to make before the shot-clock times out, which is 20 seconds.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

trailing_poss_plot = nba.plot_trailing_possessions(df)
show(trailing_poss_plot)

As we can see, teams behind by a single possession have a lower observed foul rate than those that are behind by three possessions.

We also notice that certain calls are much more associated with fouls. We will include this in our model as well.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

type_plot = nba.plot_call_type_means(df.groupby("call_type").mean()["foul_called"])
show(type_plot)

Possesion model

In the improved model, we explore if who has possession of the ball determines how likely it is for a foul to be committed. This model incorporates a latent variable $\beta^{\textrm{call}}_{c}$ for each call-type, as well as latent variable $\beta^{\textrm{poss}}_{\textrm{t,r,c}}$ for each combination of call type $c$ , how many possessions the committing team is trailing by $t$ , and how many possessions $r$ are left in the game. Together all these $\beta$ variables are added to create a score $\eta^{\textrm{game}}_k$ which is converted into a probability of a foul call $p_k$ , which is finally called into a Bernoulli to make our observation.

\begin{aligned} \beta^{\textrm{season}}_s & \sim N(0, 5) \\ \sigma_{\textrm{call}} & \sim \textrm{HalfNormal}(5) \\ \beta^{\textrm{call}}_{c} & \sim N(0, \sigma_{\textrm{call}}) \\ \sigma_{\textrm{poss,c}} & \sim \textrm{HalfNormal}(5) \\ \beta^{\textrm{poss}}_{\textrm{t,r,c}} & \sim N(0, \sigma_{\textrm{poss,c}}) \\ \eta^{\textrm{game}}_k & = \beta^{\textrm{season}}_{s(k)} + \beta^{\textrm{call}}_{c(k)} + \beta^{\textrm{poss}}_{\textrm{t(k),r(k),c(k)}}\\ p_k & = \textrm{sigmoid}\left(\eta^{\textrm{game}}_k\right) \\ y_k & \sim \textrm{Bernoulli}(p_k) . \end{aligned}

class PossesionModel:
    def __init__(self, df):
        self.df = df
        self.n_season = len(self.df["season"].unique())
        self.n_call_type = len(self.df["call_type"].unique())
        self.n_trailing_poss = len(self.df["trailing_poss"].unique())
        self.n_remaining_poss = len(self.df["remaining_poss"].unique())

    def __repr__(self):
        return ""

    @bm.random_variable
    def beta_season(self) -> RVIdentifier:
        return dist.Normal(0.0, 5.0).expand((self.n_season,))

    @bm.random_variable
    def sigma_call(self) -> RVIdentifier:
        return dist.HalfNormal(5.0)

    @bm.random_variable
    def beta_call(self) -> RVIdentifier:
        return dist.Normal(0, 1).expand((self.n_call_type,))

    @bm.random_variable
    def sigma_poss(self) -> RVIdentifier:
        return dist.HalfNormal(5.0).expand((1, 1, self.n_call_type))

    @bm.random_variable
    def beta_poss(self) -> RVIdentifier:
        return dist.Normal(0, 1).expand(
            (self.n_trailing_poss, self.n_remaining_poss, self.n_call_type)
        )

    @bm.functional
    def p(self) -> RVIdentifier:
        b_season = self.beta_season()
        b_call = self.beta_call() * self.sigma_call()
        b_poss = self.beta_poss() * self.sigma_poss()
        eta_game = (
            b_season[self.df["season_id"].values]
            + b_call[self.df["call_type_id"].values]
            + b_poss[
                self.df["trailing_poss_id"].values,
                self.df["remaining_poss_id"].values,
                self.df["call_type_id"].values,
            ]
        )
        return torch.sigmoid(eta_game)

    @bm.random_variable
    def y(self) -> RVIdentifier:
        return dist.Bernoulli(self.p())

You may have noticed that the Bean Machine program is not a direct translation of the model. We do this since the NUTS sampler works best when most of the latent variables are at roughly the same scale. So instead of sampling from $x \sim \mathcal{N}(0, \sigma)$ , we sample from $y \sim \mathcal{N}(0,1)$ and assign $x = y \cdot \sigma$ . The effective distribution of $x$ remains the same, but the scale of the algorithmically sampled random variable $y$ is no longer directly dependent on $\sigma$ . This is called a non-centered reparameterization and is an encouraged change to make to any model you intend to use with NUTS.

poss_model = PossesionModel(df)
poss_model_queries = [poss_model.p()]
poss_model_observations = {
    poss_model.y(): torch.tensor(df["foul_called"].astype(float).values)
}

poss_samples = bm.GlobalNoUTurnSampler(target_accept_prob=0.90, nnc_compile=nnc_compile).infer(
    queries=poss_model_queries,
    observations=poss_model_observations,
    num_samples=num_samples,
    num_chains=num_chains,
    num_adaptive_samples=num_adaptive_samples,
)

Out:

Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]

poss_trace = poss_samples.to_inference_data()

When we check diagnostics, we see the $ess$ is decent and $\hat{R}$ isn't too high.

poss_summary_df = az.summary(poss_trace, round_to=3)

idx = poss_summary_df.index.tolist()[:5] + poss_summary_df.index.tolist()[-5:]
Markdown(poss_summary_df.loc[idx].to_markdown())

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
p(,)[0]	0.509	0.056	0.42	0.598	0.014	0.01	16.027	235.105	1.124
p(,)[1]	0.416	0.049	0.335	0.497	0.011	0.008	19.563	58.273	1.109
p(,)[2]	0.406	0.044	0.342	0.468	0.004	0.003	123.779	99.481	1.032
p(,)[3]	0.213	0.04	0.15	0.266	0.01	0.007	14.438	433.81	1.152
p(,)[4]	0.211	0.042	0.154	0.274	0.011	0.008	16.576	71.769	1.126
p(,)[8625]	0.849	0.018	0.82	0.875	0.001	0.001	231.829	561.624	1.02
p(,)[8626]	0.27	0.031	0.215	0.316	0.003	0.002	84.273	163.17	1.032
p(,)[8627]	0.849	0.018	0.82	0.875	0.001	0.001	231.829	561.624	1.02
p(,)[8628]	0.258	0.035	0.208	0.314	0.009	0.006	15.719	32.625	1.124
p(,)[8629]	0.299	0.031	0.253	0.349	0.004	0.003	55.64	272.819	1.058

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

poss_autocorr_plot = gridplot(list(az.plot_autocorr(poss_trace, show=False)))
show(poss_autocorr_plot)

phat = poss_samples[poss_model.p()]

We also check if the per-season residuals have improved.

resid_df = df.assign(p_hat=phat.numpy().mean(axis=0).mean(axis=0)).assign(
    resid=lambda df: df.foul_called - df.p_hat
)

print("Possession model")
possession_model_resid_df = resid_df.pivot_table("resid", "season")
display(Markdown(possession_model_resid_df.to_markdown()))
print("Basic model")
display(Markdown(basic_model_resid_df.to_markdown()))

Out:

Possession model

season	resid
2015-2016	0.000231191
2016-2017	-0.000212147

Out:

Basic model

season	resid
2015-2016	-0.000241572
2016-2017	-0.000126373

We also check if our residuals are sensitive to how many seconds are left in the game. Since they are not, it means our possession model better captures the connection between fouls and how many seconds remain in the game.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

poss_resid_plot = nba.plot_possession_model_residuals(resid_df)
show(poss_resid_plot)

We now compare these the possession model to the basic model. To do this we will use the Watanabe-Akaike or widely available information criterion (WAIC). An information criterion is a way of scoring how well does a model explain some data. WAIC in particular is defined as:

\text{WAIC} = -2 \frac{1}{n} \sum_{i=1}^n \log \mu(y_i) - \sigma_{\text{log}}^2(y_i) \\ \mu(y_i) = \mathbb{E}_{p(\theta \mid y)}\left[p(y_i \mid \theta)\right] \\ \sigma_{\text{log}}^2(y_i) = \mathbb{E}_{p(\theta \mid y)}(\log p(y_i \mid \theta))

We can approximate the expectations in this equation using samples from our posterior.

When we compare these models we see that the possession model better explains the data and is given a higher rank.

Markdown(az.compare({"possession": poss_trace, "basic": basic_trace}, ic="waic").to_markdown())

	rank	waic	p_waic	d_waic	weight	se	dse	warning	waic_scale
possession	0	-4849.7	76.6483	0	0.5	0	0	True	log
basic	1	-5576.19	0.911201	726.494	0.5	0	0	True	log

Player item-response theory model

We now try to improve our model once more by modeling the players involved in the foul calls. To do this we will associate with each player two latent variables, a propensity to foul another player $a$ , and a propensity to be fouled by another player $b$ . We can then have of a latent variable that for pairs of players $i$ and $j$ that models that interaction $\eta^{\textrm{player}}_{i,j} = a_i - b_j$ . We further model this propensity on a per-season basis. This new factor is added to the $\eta$ from the possession model where the sum is passed into a sigmoid to turn it into a probability $p_k$ . As in all the other models, this $p_k$ is passed to a Bernoulli which is observed.

\begin{aligned} \sigma_{\textrm{a}} & \sim \textrm{HalfNormal}(5) \\ a^{\textrm{player}}_{\textrm{i,s}} & \sim N(0, \sigma_{\textrm{a}}) \\ \sigma_{\textrm{b}} & \sim \textrm{HalfNormal}(5) \\ b^{\textrm{player}}_{\textrm{j,s}} & \sim N(0, \sigma_{\textrm{b}}) \\ \beta^{\textrm{season}}_s & \sim N(0, 5) \\ \sigma_{\textrm{call}} & \sim \textrm{HalfNormal}(5) \\ \beta^{\textrm{call}}_{c} & \sim N(0, \sigma_{\textrm{call}}) \\ \sigma_{\textrm{poss,c}} & \sim \textrm{HalfNormal}(5) \\ \beta^{\textrm{poss}}_{\textrm{t,r,c}} & \sim N(0, \sigma_{\textrm{poss,c}}) \\ \eta^{\textrm{player}}_{l, s} & = a^{\textrm{player}}_{l, s} - b^{\textrm{player}}_{l, s} \\ \eta^{\textrm{game}}_k & = \beta^{\textrm{season}}_{s(k)} + \beta^{\textrm{call}}_{c(k)} + \beta^{\textrm{poss}}_{\textrm{t(k),r(k),c(k)}} + \eta^{\textrm{player}}_{l(k), s(k)}\\ p_k & = \textrm{sigmoid}\left(\eta^{\textrm{game}}_k\right) \\ y_k & \sim \textrm{Bernoulli}(p_k) . \end{aligned}

class IRTModel:
    def __init__(self, df):
        self.df = df
        self.n_season = len(self.df["season"].unique())
        self.n_player = len(
            set(df["committing_player"].tolist() + df["disadvantaged_player"].tolist())
        )
        self.n_call_type = len(self.df["call_type"].unique())
        self.n_trailing_poss = len(self.df["trailing_poss"].unique())
        self.n_remaining_poss = len(self.df["remaining_poss"].unique())

    def __repr__(self):
        return ""

    @bm.random_variable
    def sigma_a(self) -> RVIdentifier:
        return dist.HalfNormal(5.0)

    @bm.random_variable
    def a_player(self) -> RVIdentifier:
        return dist.Normal(0.0, 1.0).expand((self.n_player, self.n_season))

    @bm.random_variable
    def sigma_b(self) -> RVIdentifier:
        return dist.HalfNormal(1.0)

    @bm.random_variable
    def b_player(self) -> RVIdentifier:
        return dist.Normal(0.0, 1.0).expand((self.n_player, self.n_season))

    @bm.random_variable
    def beta_season(self) -> RVIdentifier:
        return dist.Normal(0.0, 5.0).expand((self.n_season,))

    @bm.random_variable
    def sigma_call(self) -> RVIdentifier:
        return dist.HalfNormal(5.0)

    @bm.random_variable
    def beta_call(self) -> RVIdentifier:
        return dist.Normal(0, 1).expand((self.n_call_type,))

    @bm.random_variable
    def sigma_poss(self) -> RVIdentifier:
        return dist.HalfNormal(5.0).expand((1, 1, self.n_call_type))

    @bm.random_variable
    def beta_poss(self) -> RVIdentifier:
        return dist.Normal(0, 1).expand(
            (self.n_trailing_poss, self.n_remaining_poss, self.n_call_type)
        )

    @bm.functional
    def p(self) -> RVIdentifier:
        b_season = self.beta_season()
        b_call = self.beta_call() * self.sigma_call()
        b_poss = self.beta_poss() * self.sigma_poss()

        a_player = self.a_player() * self.sigma_a()
        b_player = self.b_player() * self.sigma_b()

        season = self.df["season_id"].values
        player_disadvantaged = self.df["disadvantaged_player_id"].values
        player_committing = df["committing_player_id"].values
        eta_player = (
            a_player[player_disadvantaged, season] - b_player[player_committing, season]
        )

        eta_game = (
            b_season[season]
            + b_call[self.df["call_type_id"].values]
            + b_poss[
                self.df["trailing_poss_id"].values,
                self.df["remaining_poss_id"].values,
                self.df["call_type_id"].values,
            ]
        ) + eta_player
        return torch.sigmoid(eta_game)

    @bm.random_variable
    def y(self) -> RVIdentifier:
        return dist.Bernoulli(self.p())

num_samples = 1 if smoke_test else 1500
num_adaptive_samples = 1 if smoke_test else num_samples // 2

irt_model = IRTModel(df)
observations = {irt_model.y(): torch.tensor(df["foul_called"].astype(float).values)}

irt_samples = bm.GlobalNoUTurnSampler(target_accept_prob=0.90, nnc_compile=nnc_compile).infer(
    queries=[irt_model.p()],
    observations=observations,
    num_samples=num_samples,
    num_chains=num_chains,
    num_adaptive_samples=num_adaptive_samples,
)

Out:

Samples collected:   0%|          | 0/2250 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/2250 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/2250 [00:00<?, ?it/s]

irt_trace = irt_samples.to_inference_data()

irt_summary_df = az.summary(irt_trace)

idx = irt_summary_df.index.tolist()[:5] + irt_summary_df.index.tolist()[-5:]
Markdown(irt_summary_df.loc[idx].to_markdown())

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
p(,)[0]	0.414	0.086	0.279	0.551	0.001	0.001	5265	3289	1
p(,)[1]	0.361	0.077	0.233	0.481	0.001	0.001	7964	3489	1
p(,)[2]	0.398	0.079	0.265	0.515	0.001	0.001	6713	3376	1
p(,)[3]	0.164	0.051	0.084	0.238	0.001	0.001	4292	3323	1
p(,)[4]	0.18	0.053	0.091	0.255	0.001	0.001	5795	3182	1
p(,)[8625]	0.851	0.043	0.787	0.917	0.001	0	7310	2997	1
p(,)[8626]	0.254	0.066	0.146	0.356	0.001	0.001	8382	3477	1
p(,)[8627]	0.85	0.043	0.787	0.918	0	0	8100	3317	1
p(,)[8628]	0.278	0.068	0.161	0.378	0.001	0.001	7627	3248	1
p(,)[8629]	0.267	0.064	0.163	0.365	0.001	0.001	7294	3129	1

The effective sample size is decently large and the $\hat{R}$ is close to 1, hence it seems likely that this model fitted the data well using NUTS.

Let's see how it compares with the previous models we defined and explored.

phat = irt_samples[irt_model.p()]
resid_df = df.assign(p_hat=phat.numpy().mean(axis=0).mean(axis=0)).assign(
    resid=lambda df: df.foul_called - df.p_hat
)

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

irt_resid_plot = nba.plot_irt_residuals(resid_df)
show(irt_resid_plot)

irt_resid_df = resid_df.pivot_table("resid", "season")

print("Possession model")
display(Markdown(possession_model_resid_df.to_markdown()))
print("Basic model")
display(Markdown(basic_model_resid_df.to_markdown()))
print("IRT model")
display(Markdown(irt_resid_df.to_markdown()))

Out:

Possession model

season	resid
2015-2016	0.000231191
2016-2017	-0.000212147

Out:

Basic model

season	resid
2015-2016	-0.000241572
2016-2017	-0.000126373

Out:

IRT model

season	resid
2015-2016	-0.000191984
2016-2017	0.000177422

The residuals are not demonstratably smaller than the possession model, but they are still smaller than the basic model. Finally, let's compare all three models using WAIC.

Markdown(
    az.compare(
        {"irt": irt_trace, "possession": poss_trace, "basic": basic_trace},
        ic="waic",
    ).to_markdown()
)

	rank	waic	p_waic	d_waic	weight	warning	waic_scale
possession	0	-4849.7	76.6483	0	0.333333	True	log
irt	1	-5087.1	439.29	237.402	0.333333	True	log
basic	2	-5576.19	0.911201	726.494	0.333333	True	log

We can see the IRT model is slightly better than the possession model, and both are better than the basic model.

Conclusion

With this tutorial we analyzed NBA foul data and came up with three models to predict when fouls happen. The best model explicitly models the players themselves and their own propensity to foul and be fouled. IRT models have a rich history in psychometrics and educational testing. If you have ever taken a standardized test, an IRT model was used to convert your test answers into an inference about your aptitude in math or reading comprehension.

References

Austin Rochford's 2018 analysis https://austinrochford.com/posts/2018-02-04-nba-irt-2.html
2015/2016 NBA season games https://github.com/polygraph-cool/last-two-minute-report
Last 2 minute reports http://official.nba.com/2017-18-nba-officiating-last-two-minute-reports/
Practical Issues in Implementing and Understanding Bayesian Ideal Point Estimation
Bayesian Item Response Modeling—Theory and Applications
NBA's Last Two Minute Report
Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC (2021) Rank-Normalization, Folding, and Localization: An Improved $\hat{R}$ for Assessing Convergence of MCMC (with Discussion). Bayesian Analysis 16(2) 667–718. doi: 10.1214/20-BA1221.

Tutorial: Modeling NBA foul calls using Item Response Theory​

Learning Outcomes​

Prerequisites​

Data​

Basic model​

Inference​

Analysis​

Measuring variance between and within chains with R^\hat{R}R^​

Effective sample size essessess​

Posterior plot of the Basic model​

Improving on the Basic Model​

Possesion model​

Player item-response theory model​

Conclusion​

References​

Tutorial: Modeling NBA foul calls using Item Response Theory

Learning Outcomes

Prerequisites

Data

Basic model

Inference

Analysis

Measuring variance between and within chains with $\hat{R}$

Effective sample size $ess$

Posterior plot of the Basic model

Improving on the Basic Model

Possesion model

Player item-response theory model

Conclusion

References