Hierarchical regression

Tutorial: Hierarchical regression

In this tutorial we will explore linear regression in combination with hierarchical priors. We will be using data from Gelman and Hill on radon levels found in buildings in Minnesota; Hill J and Gelman A. The original paper about radon concentrations in Minnesota can be found in Price. Measurements on radon activity were taken for many of the states, however, we will only focus on Minnesota in order to keep the tutorial simple. The observations we wish to model are the radon activity found in buildings, dependent on the floor the measurement was taken and the mean county's concentration of uranium found in the soil. The covariates, floor-level and uranium concentration, are predictors for the amount of radon activity found in buildings. Each building's floor-level will be used as individual-level predictors for our model and the county's uranium concentration will be used as our group-level predictor.

Learning Outcomes

On completion of this tutorial, you should be able:

to prepare data for running a hierarchical regression with Bean Machine;
to execute a hierarchical regression with beanmachine;
to run diagnostics to understand what beanmachine is doing.

Problem

Our problem is to perform posterior inference over a linear model based on a hierarchical prior. Our data contains individual-level predictor measurements along with group-level predictor measurements. We will use concepts taken from the Hierarchical Model tutorial, and apply them here.

Prerequisites

We will be using the following packages within this tutorial.

arviz and bokeh for interactive visualizations; and
pandas for data manipulation.

# Install Bean Machine in Colab if using Colab.
import sys


if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
    !pip install beanmachine

import os
import arviz as az
import beanmachine.ppl as bm
import pandas as pd
import torch
import torch.distributions as dist
from beanmachine.ppl.model import RVIdentifier
from beanmachine.tutorials.utils import radon
from bokeh.io import output_notebook
from bokeh.plotting import gridplot, show
from IPython.display import Markdown
from torch import tensor

The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for torch for reproducibility.

# Plotting settings
az.rcParams["plot.backend"] = "bokeh"
az.rcParams["stats.hdi_prob"] = 0.89
az.rcParams["plot.max_subplots"] = 85

# Manual seed for torch
torch.manual_seed(1199)

# Other settings for the notebook
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ

Data

We will explore the data for the tutorial first so we can create a data story around them. The story will be used to inform decisions we will make about priors and sampling distributions. The unprocessed data consist of two different files, the srrs2.dat and the cty.dat, which can be found on Gelman's website. We will import a preprocessed version ready for our Bean Machine model here.

df = radon.load_data()
Markdown(df.head().set_index("county_index").to_markdown())

county_index	county	floor	activity	log_activity	Uppm	log_Uppm
0	AITKIN	1	2.2	0.832909	0.502054	-0.689048
0	AITKIN	0	2.2	0.832909	0.502054	-0.689048
0	AITKIN	0	2.9	1.09861	0.502054	-0.689048
0	AITKIN	0	1	0.0953102	0.502054	-0.689048
1	ANOKA	0	2.8	1.06471	0.428565	-0.847313

The data consist of the following columns.

Column name	Description
county_index	An index associated with each county in Minnesota.
county	The name of the county in Minnesota.
floor	The floor on which radon activity was measured.
activity	The measured radon activity (pCi / L)
log_activity	log(activity) `individual-level predictor`
Uppm	Mean concentration of uranium in a county's soil.
log_Uppm	log(Uppm) `group-level predictor`

The activity measurement has units of pico-Curies per liter, while the uranium is measured in parts-per-million. We will be using the log(activity) and log(Uppm) in our model, not the raw activity or Uppm values. Below we plot the raw data and the log of the raw data.

# Needed for Colab
output_notebook()

log_plot_comparison = radon.log_plot_comparison(df["activity"])
show(log_plot_comparison)

Both plots show the histogram and the kernel density estimate of the radon activity data. The left plot shows the raw activity data, while the right plot shows the logarithm of the activity data. The shape of the data in the right plot appears to be normally distributed, and we will use this observation in our model, when we describe the model below.

We can see how the logarithm of radon activity changes as you change your floor level in any particular building in our data. We plot the log_activity vs floor below.

The floor plot is showing all measurements within the data (the markers) for each floor. We have added a small amount of horizontal jitter to the data to prevent all markers from lining up vertically. The lines show the kernel density estimate (KDE) with the horizontal dashed lines indicating the maximum value of the KDE. We can see from this plot that radon levels decrease as you move away from the basement.

# Needed for Colab
output_notebook()

floor_plot = radon.floor_plot(df)
show(floor_plot)

Model

Our model will estimate the effects that individual-level predictors and group-level predictors have on estimating radon activity. We will incorporate the individual-level predictors found from the floor values in the srrs data, and the group-level predictors found from the uranium values in the cty data. In order to do that, we turn our earlier observation—that the response $y$ (the logarithmic activity level) is approximately normally distributed—into a modeling assumption in terms of a prior over $y$ . Thus we can state that

y\sim\text{Normal}(\mu_y, \sigma_y).

We will incorporate the different predictors as linear effects into our model by noting that any effect a predictor has on our response $y$ , will effect the mean value $\mu_y$ . Thus

\mu_y=\alpha + \beta x + \gamma u + \epsilon_y,

where

\begin{aligned} \alpha &= \text{intercept} \\ \beta &= \text{coefficient for log(activity)} \\ x &= \text{individual-level predictor (floor)}\in\mathbb{R} \\ \gamma &= \text{coefficient for log(Uppm)} \\ u &= \text{group-level predictors}\in\mathbb{R} \\ \epsilon_y &= \text{residual}. \end{aligned}

The current model description above pools all data together and assumes that parameters for each county $i$ $(\alpha_i, \beta_i, \gamma_i)$ are being sampled from the same distribution for that parameter. We know that each county will be different so we include a partial-pooling step in the model for just the $\alpha$ parameter. We could create a hierarchical level for each parameter, but we will keep the model simple.

The full generative model is given below.

\begin{aligned} \sigma_{\alpha} &\sim \text{HalfCauchy}(5)\\ \mu_{\alpha} &\sim \text{Normal}(0, 10)\\ \alpha_{i[n]} &\sim \text{Normal}(\mu_{\alpha}, \sigma_{\alpha})\\ \beta &\sim \text{Normal}(0, 10)\\ \gamma &\sim \text{Normal}(0, 10)\\ \mu_{y_n} &= \alpha_{i[n]} + \beta x_n + \gamma u_{i[n]}\\ \sigma_{y_n} &\sim \text{HalfCauchy}(5)\\ y_n &\sim \text{Normal}(\mu_{y_n}, \sigma_y), \end{aligned}

NOTE We can implement this model in Bean Machine by defining random variable objects with the decorators @bm.random_variable and @bm.functional. These functions behave differently than ordinary Python functions.

Semantics for @bm.random_variable functions:

They must return PyTorch Distribution objects.
Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
Please see the documentation for more information about this decorator.

Semantics for @bm.functional:

This decorator is used to deterministically transform the results of one or more random variables.

In the below Bean Machine model we are taking advantage of batching in pytorch by using the .expand() attribute on our distributions. In our definition of alpha, we expand the random variable to be the size of the number of counties in our data; $J=85$ . We then use indexing on alpha in our definition of alpha_hat to create a tensor that is the size of our observations.

J = df["county"].unique().shape[0]
county_index = df["county_index"].tolist()
floor = tensor(df["floor"].tolist())
log_uranium = tensor(df["log_Uppm"].tolist())


@bm.random_variable
def mu_alpha() -> RVIdentifier:
    return dist.Normal(0, 10)


@bm.random_variable
def sigma_alpha() -> RVIdentifier:
    return dist.HalfCauchy(5)


@bm.random_variable
def alpha() -> RVIdentifier:
    return dist.Normal(mu_alpha(), sigma_alpha()).expand((J,))


@bm.random_variable
def gamma() -> RVIdentifier:
    return dist.Normal(0, 10)


@bm.random_variable
def beta() -> RVIdentifier:
    return dist.Normal(0, 10)


@bm.random_variable
def sigma_y() -> RVIdentifier:
    return dist.HalfCauchy(5)


@bm.functional
def alpha_hat():
    return alpha()[county_index] + gamma() * log_uranium


@bm.functional
def y_hat():
    return alpha_hat() + beta() * floor


@bm.random_variable
def y() -> RVIdentifier:
    return dist.Normal(y_hat(), sigma_y())

Note that the $\alpha$ random variable is expanding the returned tensor so that it is the same size as the number of counties we have in the data (J). Defining the model in this way allows us to run inference on $y$ in batches.

Priors

The priors we have chosen for the model consist of the Normal and HalfCauchy distributions. Both of these priors are weakly informative to our model. A few plots showing what these priors look like are shown below.

# Needed for Colab
output_notebook(hide_banner=True)

sample_of_priors = radon.sample_of_priors()
show(sample_of_priors)

Inference

We are ready to run inference on our model and observations. Bean Machine supports a rich library of inference algorithms that share a common infer method with the following arguments:

Name	Usage
`queries`	List of `@bm.random_variable` targets to fit posterior distributions for.
`observations`	A dictionary of observations, as built above.
`num_samples`	Number of Monte Carlo samples to approximate the posterior distributions for the variables in `queries`.
`num_chains`	Number of separate inference runs to use. Multiple chains can help verify that inference ran correctly.

For this particular problem, we will use the GlobalNoUTurnSampler inference method. Bean Machine's GlobalNoUTurnSampler is the NUTS sampler you have probably used before with stan or pymc3. We have chosen to use the NUTS sampler here because it can be easily compared to other probabilistic tools.

num_samples = 2 if smoke_test else 5000
num_adaptive_samples = 0 if smoke_test else num_samples // 2
num_chains = 4
# Experimental backend using the Pytorch NNC compiler
nnc_compile = "SANDCASTLE_NEXUS" not in os.environ

# Data for the model.
observations = {y(): tensor(df["log_activity"].values)}

# The parameters to our model are our queries.
queries = [
    alpha(),
    mu_alpha(),
    sigma_alpha(),
    alpha_hat(),
    beta(),
    gamma(),
    sigma_y(),
    y_hat(),
]

# Run the inference.
samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(
    queries=queries,
    observations=observations,
    num_samples=num_samples,
    num_chains=num_chains,
    num_adaptive_samples=num_adaptive_samples,
)

Out:

Samples collected:   0%|          | 0/7500 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/7500 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/7500 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/7500 [00:00<?, ?it/s]

Out:

/Users/jpchen/beanmachine/src/beanmachine/ppl/experimental/nnc/utils.py:21: UserWarning: The support of NNC compiler is experimental and the API is subject tochange in the future releases of Bean Machine. For questions regarding NNC, pleasecheckout the functorch project (https://github.com/pytorch/functorch).

"The support of NNC compiler is experimental and the API is subject to"

Analysis

We begin our analysis by printing out summary statistics. Two important statistics to take note of are the $\hat{R}$ (r_hat) and effective sample size (ess) values in the below dataframe, see Vehtari et al.

summary_df = az.summary(samples.to_xarray(), round_to=3)
Markdown(summary_df.sample(20).T.to_markdown())

	alpha_hat()[565]	y_hat()[467]	alpha_hat()[248]	alpha_hat()[364]	y_hat()[509]	y_hat()[918]	y_hat()[746]	alpha_hat()[568]	alpha_hat()[94]	y_hat()[829]	y_hat()[260]	y_hat()[43]	alpha_hat()[825]	y_hat()[268]	alpha_hat()[576]	y_hat()[742]	alpha_hat()[376]	y_hat()[552]	alpha_hat()[518]	y_hat()[632]
mean	1.689	0.855	1.779	1.398	1.748	1.7	0.966	1.659	1.196	1.651	1.113	0.917	1.651	1.398	1.196	0.966	1.222	0.771	1.639	1.458
sd	0.146	0.139	0.142	0.063	0.149	0.158	0.065	0.153	0.13	0.139	0.139	0.086	0.139	0.063	0.098	0.065	0.142	0.158	0.155	0.132
hdi_5.5%	1.455	0.631	1.559	1.3	1.515	1.45	0.861	1.414	1.003	1.438	0.884	0.779	1.438	1.3	1.044	0.861	0.998	0.52	1.393	1.259
hdi_94.5%	1.913	1.069	2.007	1.499	1.979	1.947	1.071	1.898	1.414	1.875	1.326	1.052	1.875	1.499	1.355	1.071	1.452	1.019	1.875	1.669
mcse_mean	0.002	0.004	0.003	0.001	0.003	0.002	0.001	0.002	0.002	0.004	0.002	0.001	0.004	0.001	0.001	0.001	0.002	0.003	0.004	0.006
mcse_sd	0.001	0.003	0.002	0.001	0.002	0.002	0.001	0.002	0.002	0.003	0.002	0.001	0.003	0.001	0.001	0.001	0.001	0.002	0.003	0.004
ess_bulk	4414.84	1421.16	1702.2	6119.03	2941.53	5056.55	2508.64	5112.09	3544.69	1088.88	3278.22	5764.55	1088.88	6119.03	4347.95	2508.64	5918.05	4067.78	1534.74	532.076
ess_tail	5613.32	4614.7	5194.52	6294.34	5529.94	6458.84	4256.39	4780.85	5693.09	6977.78	2828.78	10452.3	6977.78	6294.34	7064.77	4256.39	7357.53	4266.7	6050	2130.18
r_hat	1.002	1.002	1.011	1.009	1.008	1.001	1.013	1.008	1.002	1.005	1.001	1.001	1.005	1.009	1.006	1.013	1.002	1.009	1.005	1.006

$\hat{R}$ diagnostic

$\hat{R}$ is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then $\hat{R}\approx 1$ , since the between-chain and within-chain variance should be equal. $\hat{R}$ is calculated as

\hat{R}=\frac{\hat{V}}{W}

where $W$ is the within-chain variance and $\hat{V}$ is the posterior variance estimate for the pooled rank-traces. The take-away here is that $\hat{R}$ converges towards 1 when each of the Markov chains approaches perfect adaptation to the true posterior distribution. We do not recommend using inference results if $\hat{R}>1.01$ . More information about $\hat{R}$ can be found in the Vehtari et al paper.

Effective sample size $ess$ diagnostic

MCMC samplers do not draw independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper, briefly it is a measure that combines information from the $\hat{R}$ value with the autocorrelation estimates within the chains. There are many ways to estimate effective samples sizes, however, we will be using the method defined in the Vehtari et al paper.

The rule of thumb for ess_bulk is for this value to be greater than 100 per chain on average. Since we ran four chains, we need ess_bulk to be greater than 400 for each parameter. The ess_tail is an estimate for effectively independent samples considering the more extreme values of the posterior. This is not the number of samples that landed in the tails of the posterior. It is a measure of the number of effectively independent samples if we sampled the tails of the posterior. The rule of thumb for this value is also to be greater than 100 per chain on average.

We can use arviz to plot the evolution of the effective sample size for each of our parameters. The below plots show over each successive draw from our model, the estimated effective sample size for our parameters. The red horizontal dashed line shows the rule-of-thumb value needed for both the bulk and tail ess estimates. When the model is converging properly, both the bulk and tail lines should be roughly linear.

# Needed for Colab
output_notebook(hide_banner=True)

keys = ["μ_α", "σ_α", "β", "γ", "σ_y"]
values = [mu_alpha(), sigma_alpha(), beta(), gamma(), sigma_y()]
parameters = dict(zip(keys, values))
plots = []
for title, parameter in parameters.items():
    data = {title: samples.get(parameter).numpy()}
    f = az.plot_ess(data, kind="evolution", show=False)[0][0]
    f.plot_width = 275
    f.plot_height = 275
    f.y_range.start = 0
    f.outline_line_color = "black"
    f.grid.grid_line_alpha = 0.2
    f.grid.grid_line_color = "grey"
    f.grid.grid_line_width = 0.2
    plots.append(f)
query_ess = gridplot([plots[:3], plots[3:]])
show(query_ess)

Rather than plot all effective sample size evolutions for each county's $\alpha$ parameter, we will do so for a sample.

# Needed for Colab
output_notebook(hide_banner=True)

sample_counties = {
    item["county_index"]: item["county"]
    for item in df[
        df["county"].isin(
            [
                "LAC QUI PARLE",
                "AITKIN",
                "KOOCHICHING",
                "DOUGLAS",
                "CLAY",
                "STEARNS",
                "RAMSEY",
                "ST. LOUIS",
            ]
        )
    ][["county_index", "county"]]
    .drop_duplicates()
    .to_dict(orient="records")
}
data = {"α": samples.get(alpha()).numpy()}
alpha_ess = az.plot_ess(data, kind="evolution", show=False).reshape(-1)

plots = []
for index, county in sample_counties.items():
    f = alpha_ess[index]
    f.title.text = f"α[{county}]"
    f.plot_width = 275
    f.plot_height = 275
    f.y_range.start = 0
    f.outline_line_color = "black"
    f.grid.grid_line_alpha = 0.2
    f.grid.grid_line_color = "grey"
    f.grid.grid_line_width = 0.2
    plots.append(f)
sample_counties_ess = gridplot([plots[:3], plots[3:6], plots[6:]])
show(sample_counties_ess)

Combined with our above analysis, we can show the rank plots of our model parameters. You might be familiar with trace plots that show parameter posterior distributions along with the "fuzzy caterpillar" plots of sampled values. Rank plots perform the same function as trace plots in that they show problems in the mixing of the chains. Rank plots for our model's parameters are shown below.

# Needed for Colab
output_notebook()

trace_ranks = radon.plot_trace_ranks(
    keys=["μ_α", "σ_α", "β", "γ", "σ_y"],
    values=[mu_alpha(), sigma_alpha(), beta(), gamma(), sigma_y()],
    samples=samples,
)
show(trace_ranks)

Each of the above rank plots shows fairly uniform histograms for each parameter.

Recall that $\beta$ is the coefficient for the log_activity data. When we plotted the logarithm of radon activity vs the floor measurement we saw that there was a decrease in $log(radon)$ as we moved from the basement to the first floor of a building. From our initial data analysis, we expected the coefficient to be negative. Plotting the posterior for $\beta$ (below) we can see that it is always less than zero with a mean value of -0.64 and the 89% highest density interval between [-0.75, -0.54]. Similarly we can plot the posterior distribution for the coefficient of $log(uranium)$ , which were our group-level predictors.

Why an 89% highest density interval (HDI)? To prevent you from thinking about a 95% confidence interval. See McElreath for further discussion on this subject.

# Needed for Colab
output_notebook()

beta_posterior = az.plot_posterior({"β": samples.get(beta()).numpy()}, show=False,)[
    0
][0]
gamma_posterior = az.plot_posterior({"γ": samples.get(gamma()).numpy()}, show=False,)[
    0
][0]
beta_gamma = gridplot([[beta_posterior, gamma_posterior]])
show(beta_gamma)

Again we only show a sampling of the $\alpha$ parameters for the different counties in the model.

# Needed for Colab
output_notebook()

alphas = samples.get(alpha()).numpy()
data = {
    f"α[{county}]": alphas[:, :, index] for index, county in sample_counties.items()
}
county_posteriors = gridplot([*az.plot_posterior(data, show=False)])
show(county_posteriors)

We can inspect how the uranium concentration in a county's soil affects the intercept $\alpha$ by plotting our intercept estimate against the logarithm of uranium counts for the county. The whiskers/error-bars for each marker is its standard deviation from the posterior distribution.

# Needed for Colab
output_notebook()

uranium = radon.uranium(summary_df, df)
show(uranium)

References

Gelman's website http://www.stat.columbia.edu/~gelman/arm/examples/radon/
Hill J and Gelman A (2006) Data Analysis Using Regression and Multilevel/Hierarchical Models. United States: Cambridge University Press. ISBN: 9781139460934.
McElreath R (2020) Statistical Rethinking: A Bayesian Course with Examples in R and Stan 2nd edition. Chapman and Hall/CRC. doi: 10.1201/9780429029608
Price PN, Nero AV, Gelman A (1996) Bayesian Prediction of Mean Indoor Radon Concentrations for Minnesota Counties. Health Physics 71(6). doi: 10.1097/00004032-199612000-00009
Wikipedia (Decay chain) https://en.wikipedia.org/wiki/Decay_chain#Uranium_series
Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC (2021) Rank-Normalization, Folding, and Localization: An Improved $\hat{R}$ for Assessing Convergence of MCMC (with Discussion). Bayesian Analysis 16(2) 667–718. doi: 10.1214/20-BA1221.

Tutorial: Hierarchical regression​

Learning Outcomes​

Problem​

Prerequisites​

Data​

Model​

Priors​

Inference​

Analysis​

R^\hat{R}R^ diagnostic​

Effective sample size essessess diagnostic​

References​