# Item Response Theory

## Tutorial: Modeling NBA foul calls using Item Response Theory​

This tutorial demonstrates how to use Bean Machine to predict when NBA players will receive a foul call from a referee. This model and exposition is based on Austin Rochford's 2018 analysis of the 2015/2016 NBA season games data.

## Learning Outcomes​

On completion of this tutorial, you should be able:

• to prepare data for running an Item Response Theory model (IRT) with Bean Machine;
• to execute an IRT model with Bean Machine;
• to show the advantage of an IRT model over regular regression;
• to run diagnostics to understand what Bean Machine is doing; and
• to generalize the techniques demonstrated to build an IRT model based on new data.

## Prerequisites​

We will be using the following packages within this tutorial.

# Install Bean Machine in Colab if using Colab.import sysif "google.colab" in sys.modules and "beanmachine" not in sys.modules:    !pip install beanmachine
import osimport warningsimport arviz as azimport beanmachine.ppl as bmimport torchimport torch.distributions as distfrom arviz.utils import Numbafrom beanmachine.ppl.model import RVIdentifierfrom beanmachine.tutorials.utils import nbafrom bokeh.io import output_notebookfrom bokeh.plotting import gridplot, showfrom IPython.display import Markdown

The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for torch for reproducibility.

# Eliminate excess UserWarnings from ArviZ.warnings.simplefilter("ignore")# Plotting settingsaz.rcParams["plot.backend"] = "bokeh"az.rcParams["stats.hdi_prob"] = 0.89# Manual seed for torchtorch.manual_seed(1199)# Other settings for the notebook.\Numba.disable_numba()smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ

## Data​

Our data consist of every foul recorded in the last two minutes of every NBA game for the 2015/16 and 2016/17 seasons. The data consists of records for when the foul happened, which player committed it and against which other player, the current score when it happened, and the team each player was playing for.

Raw data can be found at 2015/2016 NBA season games, and the official data is from Last 2 minute reports. Here we will load data that has been cleaned for the tutorial. The following columns and their descriptions are given below. As we treat many of the columns as categoricals, an _id column is often added that is the categorical encoding of the respective column.

Column nameDescription
seconds_leftNumber of seconds remaining in the game.
call_typeType of call made (all are classified as fouls).
call_type_idID for the call.
foul_called0 = no-foul, 1 = foul.
committing_playerName of the committing player.
committing_player_idID of the committing player.
score_committingTeam's score of the committing player.
seasonSeason name; "2015-2016" or "2016-2017".
season_idID of the seson.
trailing_committingFlag to indicate if the trailing team committed the foul.
score_diffDifference between the disadvantaged and committing team's scores.
trailing_possNumber of trailing possessions.
trailing_poss_idID for trailing possessions.
remaining_possRemaining number of possessions.
remaining_poss_idID for the remaining number of possessions.
df = nba.load_data()Markdown(df.head().to_markdown())
089Shooting41Ian Mahinmi162DeMar DeRozan98991062015-20160173743
173Shooting40Bismack Biyombo36Paul George358106992015-201600-7-2232
238Loose Ball11Jordan Hill229Jonas Valanciunas222991062015-20160173721
330Offensive20Jordan Hill229DeMar DeRozan98991062015-20160173721
424Offensive20Jordan Hill229DeMarre Carroll100991062015-20160173710

Below is a figure showing the number of fouls within our data, and the observed foul call rate.

# Required for visualizing in Colab.output_notebook(hide_banner=True)foul_types_plot = nba.plot_foul_types(df["call_type"].value_counts())show(foul_types_plot)
# Required for visualizing in Colab.output_notebook(hide_banner=True)call_type_plot = nba.plot_call_type_means(df.groupby("call_type").mean()["foul_called"])show(call_type_plot)

Next we investigate the frequency with which fouls are called for each of the NBA seasons in our data. The difference between the foul rates in the 2015/16 and 2016/17 seasons is easily shown below. We will account for this in our first model.

# Required for visualizing in Colab.output_notebook(hide_banner=True)foul_freq_plot = nba.plot_foul_frequency(df.groupby("season").mean()["foul_called"])show(foul_freq_plot)

## Basic model​

The initial model is a hierarchical logistic regression with a per-season latent variable to predict foul calls. This model just tries to predict how likely a play $k$ is to lead to a foul purely based on what season it is. There are two $\beta^{\textrm{season}}_s$ variables, one for each season. We include a $\eta^{\textrm{game}}_k$ variable for every game, but set it for now equal to $\beta^{\textrm{season}}_{s(k)}$ associated with which season the play happened during. Finally, we pass $eta^{\textrm{game}}_k$ into a sigmoid function to effectively turn it into a probability $p_k$ which we pass to a Bernoulli distribution so we can observe $y_k$ which models whether play $k$ lead to a foul.

\begin{aligned} \beta^{\textrm{season}}_s & \sim N(0, 5) \\ \eta^{\textrm{game}}_k & = \beta^{\textrm{season}}_{s(k)} \\ p_k & = \textrm{sigmoid}\left(\eta^{\textrm{game}}_k\right) \\ y_k & \sim \textrm{Bernoulli}(p_k) . \end{aligned}

This is model is now roughly equivalent to using LogisticRegression in scikit-learn where the only feature is which season did the foul occur during.

NOTE We can implement this model in Bean Machine by defining random variable objects with the @bm.random_variable and @bm.functional decorators. These functions behave differently than ordinary Python functions.

Semantics for @bm.random_variable functions:
• They must return PyTorch Distribution objects.
• Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
• Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
• Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
Semantics for @bm.functional:
• This decorator is used to deterministically transform the results of one or more random variables.
class BasicModel:    def __init__(self, df):        self.df = df        self.n_season = len(self.df["season"].unique())    def __repr__(self):        return ""    @bm.random_variable    def beta(self) -> RVIdentifier:        return dist.Normal(0.0, 5.0).expand((self.n_season,))    @bm.functional    def p(self) -> RVIdentifier:        b_season = self.beta()        return torch.sigmoid(b_season)    @bm.random_variable    def y(self) -> RVIdentifier:        return dist.Bernoulli(self.p()[self.df["season_id"]])

We make use of the expand method on pytorch distribution objects to perform batch sampling. This has the advantage of being more computationally-efficient than explicit iteration. Bean Machine is also able to take advantage of the implicit broadcasting of pytorch tensors and pytorch samplers like dist.Bernoulli allowing us to sample or observe all $y$ values at once.

## Inference​

Inference is the process of combining model and data to obtain insights, in the form of probability distributions over values of interest. Bean Machine offers a powerful and general inference framework to enable fitting arbitrary models to data.

Our inference algorithms expect a list of query variables and observations in the form of a dictionary. The query list and the keys for the observation dictionary should consist of @bm.random_variable invocations as keys, and tensor data as values. You can see this in the example below.

basic_model = BasicModel(df)basic_model_queries = [basic_model.p()]basic_model_observations = {    basic_model.y(): torch.tensor(df["foul_called"].astype(float).values)}

Now, we're ready to run inference! Although Bean Machine supports a rich library of inference methods, they all support a common infer method, with these arguments:

NameUsage
queriesList of @bm.random_variable targets to fit posterior distributions for.
observationsA dictionary of observations, as built above.
num_samplesNumber of Monte Carlo samples to approximate the posterior distributions for the variables in queries.
num_chainsNumber of separate inference runs to use. Multiple chains can help verify that inference ran correctly.

For the models in this tutorial we will use the GlobalNoUTurnSampler inference algorithm, as it is particularly well-suited to handle hierarchical models with partial pooling and without discrete latent variables.

num_samples = 1 if smoke_test else 1000num_adaptive_samples = 0 if smoke_test else num_samples // 2num_chains = 1 if smoke_test else 3# Experimental backend using the Pytorch NNC compilernnc_compile = "SANDCASTLE_NEXUS" not in os.environbasic_samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(    queries=basic_model_queries,    observations=basic_model_observations,    num_samples=num_samples,    num_chains=num_chains,    num_adaptive_samples=num_adaptive_samples,)
Out:
Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]

## Analysis​

basic_samples now contains our inference results. We begin our analysis by printing out summary statistics. Two important statistics to take note of are the $\hat{R}$ (r_hat) and effective sample size (ess) values in the below dataframe, see Vehtari et al.

basic_trace = basic_samples.to_inference_data()basic_summary_df = az.summary(basic_trace, round_to=3)Markdown(basic_summary_df.to_markdown())
meansdhdi_5.5%hdi_94.5%mcse_meanmcse_sdess_bulkess_tailr_hat
p(,)[0]0.4140.0080.4020.426002686.412195.171.001
p(,)[1]0.310.0070.2990.32002501.731954.621.005

### Measuring variance between and within chains with $\hat{R}$​

$\hat{R}$ is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then $\hat{R}\approx 1$, since the between-chain and within-chain variance should be equal. $\hat{R}$ is calculated from $N$ samples as

$\hat{R}=\frac{\hat{V}}{W} \\ \hat{V} = \frac{N-1}{N} W + \frac{1}{N} B$

where $W$ is the within-chain variance, $B$ is the between-chain variance and $\hat{V}$ is the estimate of the posterior variance of the samples. The take-away here is that $\hat{R}$ converges to 1 when each of the chains begins to empirically approximate the same posterior distribution. We do not recommend using inference results if $\hat{R}>1.01$. More information about $\hat{R}$ can be found in the Vehtari et al paper.

### Effective sample size $ess$​

MCMC samplers do not draw truly independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper. In brief, it is a measure that combines information from the $\hat{R}$ value with the autocorrelation estimates within the chains.

ESS estimates come in two variants, ess_bulk and ess_tail. The former is the default, but the latter can be useful if you need good estimates of the tails of your posterior distribution. The rule of thumb for ess_bulk is for this value to be greater than 100 per chain on average. Since we ran four chains, we need ess_bulk to be greater than 400 for each parameter. The ess_tail is an estimate for effectively independent samples considering the more extreme values of the posterior. This is not the number of samples that landed in the tails of the posterior, but rather a measure of the number of effectively independent samples if we sampled the tails of the posterior. The rule of thumb for this value is also to be greater than 100 per chain on average.

### Posterior plot of the Basic model​

With samples in hand, we can observe the posteriors of $\beta$s and see that fouls are slightly more likely during the earlier NBA season.

# Required for visualizing in Colab.output_notebook(hide_banner=True)basic_model_posterior_plot = gridplot(list(az.plot_posterior(basic_trace, show=False)))show(basic_model_posterior_plot)

Below we have two diagnostic plots for individual random variables: Rank plots and autocorrelation plots.

• Rank plots are a histogram of the samples over time. All samples across all chains are ranked and then we plot the average rank for each chain on regular intervals. If the chains are mixing well this histogram should look roughly uniform. If it looks highly irregular that suggests chains might be getting stuck and not adequately exploring the sample space.

• Autocorrelation plots measure how predictive the last several samples are of the current sample. Autocorrelation may vary between -1.0 (deterministically anticorrelated) and 1.0 (deterministically correlated). We compute autocorrelation approximately, so it may sometimes exceed these bounds. In an ideal world, the current sample is chosen independently of the previous samples: an autocorrelation of zero. This is not possible in practice, due to stochastic noise and the mechanics of how inference works.

# Required for visualizing in Colab.output_notebook(hide_banner=True)str_trace = basic_trace.rename({basic_model.p(): str(basic_model.p())})str_trace_plot = gridplot(list(az.plot_trace(str_trace, kind="rank_bars", show=False)))show(str_trace_plot)
# Required for visualizing in Colab.output_notebook(hide_banner=True)basic_autocorr_plot = gridplot(list(az.plot_autocorr(basic_trace, show=False)))show(basic_autocorr_plot)

Now let's take our basic model and measure how well it predicts when fouls are called. As we can see, because the prediction can only depend on the season it doesn't do the greatest.

phat = basic_samples[basic_model.p()]
resid_df = df.assign(    p_hat=phat.numpy().mean(axis=0)[:, df.season_id].mean(axis=0)).assign(resid=lambda df: df["foul_called"] - df["p_hat"])
idx = resid_df.index.tolist()[:5] + resid_df.index.tolist()[-5:]Markdown(resid_df[["season", "foul_called", "p_hat", "resid"]].loc[idx].to_markdown())
seasonfoul_calledp_hatresid
02015-201610.4144190.585581
12015-201600.414419-0.414419
22015-201610.4144190.585581
32015-201600.414419-0.414419
42015-201600.414419-0.414419
86252016-201710.3095280.690472
86262016-201700.309528-0.309528
86272016-201710.3095280.690472
86282016-201700.309528-0.309528
86292016-201710.3095280.690472
basic_model_resid_df = resid_df.pivot_table("resid", "season")Markdown(basic_model_resid_df.to_markdown())
seasonresid
2015-2016-0.000241572
2016-2017-0.000126373

So what other features could we use? Let's plot residuals versus how many seconds are left in the game. If there is any trend in this plot, we can try to incorporate it into our model and likely improve the quality of our predictions.

# Required for visualizing in Colab.output_notebook(hide_banner=True)temp_df = resid_df.groupby("seconds_left").mean()basic_resid_plot = nba.plot_basic_model_residuals(temp_df)show(basic_resid_plot)

## Improving on the Basic Model​

We can get a sense of how to improve the model by noticing that it is the team that is behind that has a greater incentive to make a foul.

# Required for visualizing in Colab.output_notebook(hide_banner=True)trailing_team_committing_plot = nba.plot_trailing_team_committing(df)show(trailing_team_committing_plot)

Let's model that difference as a feature. To drive home how significant the trailing team's observed foul rate is within the final two minutes of a game, we will plot the foul rate vs the number of trailing possessions over time. Trailing possessions is the number of points the trailing team would need to make before the shot-clock times out, which is 20 seconds.

# Required for visualizing in Colab.output_notebook(hide_banner=True)trailing_poss_plot = nba.plot_trailing_possessions(df)show(trailing_poss_plot)

As we can see, teams behind by a single possession have a lower observed foul rate than those that are behind by three possessions.

We also notice that certain calls are much more associated with fouls. We will include this in our model as well.

# Required for visualizing in Colab.output_notebook(hide_banner=True)type_plot = nba.plot_call_type_means(df.groupby("call_type").mean()["foul_called"])show(type_plot)

## Possesion model​

In the improved model, we explore if who has possession of the ball determines how likely it is for a foul to be committed. This model incorporates a latent variable $\beta^{\textrm{call}}_{c}$ for each call-type, as well as latent variable $\beta^{\textrm{poss}}_{\textrm{t,r,c}}$ for each combination of call type $c$, how many possessions the committing team is trailing by $t$, and how many possessions $r$ are left in the game. Together all these $\beta$ variables are added to create a score $\eta^{\textrm{game}}_k$ which is converted into a probability of a foul call $p_k$, which is finally called into a Bernoulli to make our observation.

\begin{aligned} \beta^{\textrm{season}}_s & \sim N(0, 5) \\ \sigma_{\textrm{call}} & \sim \textrm{HalfNormal}(5) \\ \beta^{\textrm{call}}_{c} & \sim N(0, \sigma_{\textrm{call}}) \\ \sigma_{\textrm{poss,c}} & \sim \textrm{HalfNormal}(5) \\ \beta^{\textrm{poss}}_{\textrm{t,r,c}} & \sim N(0, \sigma_{\textrm{poss,c}}) \\ \eta^{\textrm{game}}_k & = \beta^{\textrm{season}}_{s(k)} + \beta^{\textrm{call}}_{c(k)} + \beta^{\textrm{poss}}_{\textrm{t(k),r(k),c(k)}}\\ p_k & = \textrm{sigmoid}\left(\eta^{\textrm{game}}_k\right) \\ y_k & \sim \textrm{Bernoulli}(p_k) . \end{aligned}
class PossesionModel:    def __init__(self, df):        self.df = df        self.n_season = len(self.df["season"].unique())        self.n_call_type = len(self.df["call_type"].unique())        self.n_trailing_poss = len(self.df["trailing_poss"].unique())        self.n_remaining_poss = len(self.df["remaining_poss"].unique())    def __repr__(self):        return ""    @bm.random_variable    def beta_season(self) -> RVIdentifier:        return dist.Normal(0.0, 5.0).expand((self.n_season,))    @bm.random_variable    def sigma_call(self) -> RVIdentifier:        return dist.HalfNormal(5.0)    @bm.random_variable    def beta_call(self) -> RVIdentifier:        return dist.Normal(0, 1).expand((self.n_call_type,))    @bm.random_variable    def sigma_poss(self) -> RVIdentifier:        return dist.HalfNormal(5.0).expand((1, 1, self.n_call_type))    @bm.random_variable    def beta_poss(self) -> RVIdentifier:        return dist.Normal(0, 1).expand(            (self.n_trailing_poss, self.n_remaining_poss, self.n_call_type)        )    @bm.functional    def p(self) -> RVIdentifier:        b_season = self.beta_season()        b_call = self.beta_call() * self.sigma_call()        b_poss = self.beta_poss() * self.sigma_poss()        eta_game = (            b_season[self.df["season_id"].values]            + b_call[self.df["call_type_id"].values]            + b_poss[                self.df["trailing_poss_id"].values,                self.df["remaining_poss_id"].values,                self.df["call_type_id"].values,            ]        )        return torch.sigmoid(eta_game)    @bm.random_variable    def y(self) -> RVIdentifier:        return dist.Bernoulli(self.p())

You may have noticed that the Bean Machine program is not a direct translation of the model. We do this since the NUTS sampler works best when most of the latent variables are at roughly the same scale. So instead of sampling from $x \sim \mathcal{N}(0, \sigma)$, we sample from $y \sim \mathcal{N}(0,1)$ and assign $x = y \cdot \sigma$. The effective distribution of $x$ remains the same, but the scale of the algorithmically sampled random variable $y$ is no longer directly dependent on $\sigma$. This is called a non-centered reparameterization and is an encouraged change to make to any model you intend to use with NUTS.

poss_model = PossesionModel(df)poss_model_queries = [poss_model.p()]poss_model_observations = {    poss_model.y(): torch.tensor(df["foul_called"].astype(float).values)}
poss_samples = bm.GlobalNoUTurnSampler(target_accept_prob=0.90, nnc_compile=nnc_compile).infer(    queries=poss_model_queries,    observations=poss_model_observations,    num_samples=num_samples,    num_chains=num_chains,    num_adaptive_samples=num_adaptive_samples,)
Out:
Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]
poss_trace = poss_samples.to_inference_data()

When we check diagnostics, we see the $ess$ is decent and $\hat{R}$ isn't too high.

poss_summary_df = az.summary(poss_trace, round_to=3)
idx = poss_summary_df.index.tolist()[:5] + poss_summary_df.index.tolist()[-5:]Markdown(poss_summary_df.loc[idx].to_markdown())
meansdhdi_5.5%hdi_94.5%mcse_meanmcse_sdess_bulkess_tailr_hat
p(,)[0]0.5090.0560.420.5980.0140.0116.027235.1051.124
p(,)[1]0.4160.0490.3350.4970.0110.00819.56358.2731.109
p(,)[2]0.4060.0440.3420.4680.0040.003123.77999.4811.032
p(,)[3]0.2130.040.150.2660.010.00714.438433.811.152
p(,)[4]0.2110.0420.1540.2740.0110.00816.57671.7691.126
p(,)[8625]0.8490.0180.820.8750.0010.001231.829561.6241.02
p(,)[8626]0.270.0310.2150.3160.0030.00284.273163.171.032
p(,)[8627]0.8490.0180.820.8750.0010.001231.829561.6241.02
p(,)[8628]0.2580.0350.2080.3140.0090.00615.71932.6251.124
p(,)[8629]0.2990.0310.2530.3490.0040.00355.64272.8191.058
# Required for visualizing in Colab.output_notebook(hide_banner=True)poss_autocorr_plot = gridplot(list(az.plot_autocorr(poss_trace, show=False)))show(poss_autocorr_plot)
phat = poss_samples[poss_model.p()]

We also check if the per-season residuals have improved.

resid_df = df.assign(p_hat=phat.numpy().mean(axis=0).mean(axis=0)).assign(    resid=lambda df: df.foul_called - df.p_hat)
print("Possession model")possession_model_resid_df = resid_df.pivot_table("resid", "season")display(Markdown(possession_model_resid_df.to_markdown()))print("Basic model")display(Markdown(basic_model_resid_df.to_markdown()))
Out:
Possession model
seasonresid
2015-20160.000231191
2016-2017-0.000212147
Out:
Basic model
seasonresid
2015-2016-0.000241572
2016-2017-0.000126373

We also check if our residuals are sensitive to how many seconds are left in the game. Since they are not, it means our possession model better captures the connection between fouls and how many seconds remain in the game.

# Required for visualizing in Colab.output_notebook(hide_banner=True)poss_resid_plot = nba.plot_possession_model_residuals(resid_df)show(poss_resid_plot)

We now compare these the possession model to the basic model. To do this we will use the Watanabe-Akaike or widely available information criterion (WAIC). An information criterion is a way of scoring how well does a model explain some data. WAIC in particular is defined as:

$\text{WAIC} = -2 \frac{1}{n} \sum_{i=1}^n \log \mu(y_i) - \sigma_{\text{log}}^2(y_i) \\ \mu(y_i) = \mathbb{E}_{p(\theta \mid y)}\left[p(y_i \mid \theta)\right] \\ \sigma_{\text{log}}^2(y_i) = \mathbb{E}_{p(\theta \mid y)}(\log p(y_i \mid \theta))$

We can approximate the expectations in this equation using samples from our posterior.

When we compare these models we see that the possession model better explains the data and is given a higher rank.

Markdown(az.compare({"possession": poss_trace, "basic": basic_trace}, ic="waic").to_markdown())
rankwaicp_waicd_waicweightsedsewarningwaic_scale
possession0-4849.776.648300.500Truelog
basic1-5576.190.911201726.4940.500Truelog

## Player item-response theory model​

We now try to improve our model once more by modeling the players involved in the foul calls. To do this we will associate with each player two latent variables, a propensity to foul another player $a$, and a propensity to be fouled by another player $b$. We can then have of a latent variable that for pairs of players $i$ and $j$ that models that interaction $\eta^{\textrm{player}}_{i,j} = a_i - b_j$. We further model this propensity on a per-season basis. This new factor is added to the $\eta$ from the possession model where the sum is passed into a sigmoid to turn it into a probability $p_k$. As in all the other models, this $p_k$ is passed to a Bernoulli which is observed.

\begin{aligned} \sigma_{\textrm{a}} & \sim \textrm{HalfNormal}(5) \\ a^{\textrm{player}}_{\textrm{i,s}} & \sim N(0, \sigma_{\textrm{a}}) \\ \sigma_{\textrm{b}} & \sim \textrm{HalfNormal}(5) \\ b^{\textrm{player}}_{\textrm{j,s}} & \sim N(0, \sigma_{\textrm{b}}) \\ \beta^{\textrm{season}}_s & \sim N(0, 5) \\ \sigma_{\textrm{call}} & \sim \textrm{HalfNormal}(5) \\ \beta^{\textrm{call}}_{c} & \sim N(0, \sigma_{\textrm{call}}) \\ \sigma_{\textrm{poss,c}} & \sim \textrm{HalfNormal}(5) \\ \beta^{\textrm{poss}}_{\textrm{t,r,c}} & \sim N(0, \sigma_{\textrm{poss,c}}) \\ \eta^{\textrm{player}}_{l, s} & = a^{\textrm{player}}_{l, s} - b^{\textrm{player}}_{l, s} \\ \eta^{\textrm{game}}_k & = \beta^{\textrm{season}}_{s(k)} + \beta^{\textrm{call}}_{c(k)} + \beta^{\textrm{poss}}_{\textrm{t(k),r(k),c(k)}} + \eta^{\textrm{player}}_{l(k), s(k)}\\ p_k & = \textrm{sigmoid}\left(\eta^{\textrm{game}}_k\right) \\ y_k & \sim \textrm{Bernoulli}(p_k) . \end{aligned}
class IRTModel:    def __init__(self, df):        self.df = df        self.n_season = len(self.df["season"].unique())        self.n_player = len(            set(df["committing_player"].tolist() + df["disadvantaged_player"].tolist())        )        self.n_call_type = len(self.df["call_type"].unique())        self.n_trailing_poss = len(self.df["trailing_poss"].unique())        self.n_remaining_poss = len(self.df["remaining_poss"].unique())    def __repr__(self):        return ""    @bm.random_variable    def sigma_a(self) -> RVIdentifier:        return dist.HalfNormal(5.0)    @bm.random_variable    def a_player(self) -> RVIdentifier:        return dist.Normal(0.0, 1.0).expand((self.n_player, self.n_season))    @bm.random_variable    def sigma_b(self) -> RVIdentifier:        return dist.HalfNormal(1.0)    @bm.random_variable    def b_player(self) -> RVIdentifier:        return dist.Normal(0.0, 1.0).expand((self.n_player, self.n_season))    @bm.random_variable    def beta_season(self) -> RVIdentifier:        return dist.Normal(0.0, 5.0).expand((self.n_season,))    @bm.random_variable    def sigma_call(self) -> RVIdentifier:        return dist.HalfNormal(5.0)    @bm.random_variable    def beta_call(self) -> RVIdentifier:        return dist.Normal(0, 1).expand((self.n_call_type,))    @bm.random_variable    def sigma_poss(self) -> RVIdentifier:        return dist.HalfNormal(5.0).expand((1, 1, self.n_call_type))    @bm.random_variable    def beta_poss(self) -> RVIdentifier:        return dist.Normal(0, 1).expand(            (self.n_trailing_poss, self.n_remaining_poss, self.n_call_type)        )    @bm.functional    def p(self) -> RVIdentifier:        b_season = self.beta_season()        b_call = self.beta_call() * self.sigma_call()        b_poss = self.beta_poss() * self.sigma_poss()        a_player = self.a_player() * self.sigma_a()        b_player = self.b_player() * self.sigma_b()        season = self.df["season_id"].values        player_disadvantaged = self.df["disadvantaged_player_id"].values        player_committing = df["committing_player_id"].values        eta_player = (            a_player[player_disadvantaged, season] - b_player[player_committing, season]        )        eta_game = (            b_season[season]            + b_call[self.df["call_type_id"].values]            + b_poss[                self.df["trailing_poss_id"].values,                self.df["remaining_poss_id"].values,                self.df["call_type_id"].values,            ]        ) + eta_player        return torch.sigmoid(eta_game)    @bm.random_variable    def y(self) -> RVIdentifier:        return dist.Bernoulli(self.p())
num_samples = 1 if smoke_test else 1500num_adaptive_samples = 1 if smoke_test else num_samples // 2irt_model = IRTModel(df)observations = {irt_model.y(): torch.tensor(df["foul_called"].astype(float).values)}irt_samples = bm.GlobalNoUTurnSampler(target_accept_prob=0.90, nnc_compile=nnc_compile).infer(    queries=[irt_model.p()],    observations=observations,    num_samples=num_samples,    num_chains=num_chains,    num_adaptive_samples=num_adaptive_samples,)
Out:
Samples collected:   0%|          | 0/2250 [00:00<?, ?it/s]Samples collected:   0%|          | 0/2250 [00:00<?, ?it/s]Samples collected:   0%|          | 0/2250 [00:00<?, ?it/s]
irt_trace = irt_samples.to_inference_data()
irt_summary_df = az.summary(irt_trace)
idx = irt_summary_df.index.tolist()[:5] + irt_summary_df.index.tolist()[-5:]Markdown(irt_summary_df.loc[idx].to_markdown())
meansdhdi_5.5%hdi_94.5%mcse_meanmcse_sdess_bulkess_tailr_hat
p(,)[0]0.4140.0860.2790.5510.0010.001526532891
p(,)[1]0.3610.0770.2330.4810.0010.001796434891
p(,)[2]0.3980.0790.2650.5150.0010.001671333761
p(,)[3]0.1640.0510.0840.2380.0010.001429233231
p(,)[4]0.180.0530.0910.2550.0010.001579531821
p(,)[8625]0.8510.0430.7870.9170.0010731029971
p(,)[8626]0.2540.0660.1460.3560.0010.001838234771
p(,)[8627]0.850.0430.7870.91800810033171
p(,)[8628]0.2780.0680.1610.3780.0010.001762732481
p(,)[8629]0.2670.0640.1630.3650.0010.001729431291

The effective sample size is decently large and the $\hat{R}$ is close to 1, hence it seems likely that this model fitted the data well using NUTS.

Let's see how it compares with the previous models we defined and explored.

phat = irt_samples[irt_model.p()]resid_df = df.assign(p_hat=phat.numpy().mean(axis=0).mean(axis=0)).assign(    resid=lambda df: df.foul_called - df.p_hat)
# Required for visualizing in Colab.output_notebook(hide_banner=True)irt_resid_plot = nba.plot_irt_residuals(resid_df)show(irt_resid_plot)
irt_resid_df = resid_df.pivot_table("resid", "season")print("Possession model")display(Markdown(possession_model_resid_df.to_markdown()))print("Basic model")display(Markdown(basic_model_resid_df.to_markdown()))print("IRT model")display(Markdown(irt_resid_df.to_markdown()))
Out:
Possession model
seasonresid
2015-20160.000231191
2016-2017-0.000212147
Out:
Basic model
seasonresid
2015-2016-0.000241572
2016-2017-0.000126373
Out:
IRT model
seasonresid
2015-2016-0.000191984
2016-20170.000177422

The residuals are not demonstratably smaller than the possession model, but they are still smaller than the basic model. Finally, let's compare all three models using WAIC.

Markdown(    az.compare(        {"irt": irt_trace, "possession": poss_trace, "basic": basic_trace},        ic="waic",    ).to_markdown())
rankwaicp_waicd_waicweightsedsewarningwaic_scale
possession0-4849.776.648300.33333300Truelog
irt1-5087.1439.29237.4020.33333300Truelog
basic2-5576.190.911201726.4940.33333300Truelog

We can see the IRT model is slightly better than the possession model, and both are better than the basic model.

## Conclusion​

With this tutorial we analyzed NBA foul data and came up with three models to predict when fouls happen. The best model explicitly models the players themselves and their own propensity to foul and be fouled. IRT models have a rich history in psychometrics and educational testing. If you have ever taken a standardized test, an IRT model was used to convert your test answers into an inference about your aptitude in math or reading comprehension.

## References​

1. Austin Rochford's 2018 analysis https://austinrochford.com/posts/2018-02-04-nba-irt-2.html
2. 2015/2016 NBA season games https://github.com/polygraph-cool/last-two-minute-report
3. Last 2 minute reports http://official.nba.com/2017-18-nba-officiating-last-two-minute-reports/
4. Practical Issues in Implementing and Understanding Bayesian Ideal Point Estimation
5. Bayesian Item Response Modeling—Theory and Applications
6. NBA's Last Two Minute Report
7. Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC (2021) Rank-Normalization, Folding, and Localization: An Improved $\hat{R}$ for Assessing Convergence of MCMC (with Discussion). Bayesian Analysis 16(2) 667–718. doi: 10.1214/20-BA1221.