Skip to main content

Hierarchical regression

Tutorial: Hierarchical regression

In this tutorial we will explore linear regression in combination with hierarchical priors. We will be using data from Gelman and Hill on radon levels found in buildings in Minnesota; Hill J and Gelman A. The original paper about radon concentrations in Minnesota can be found in Price. Measurements on radon activity were taken for many of the states, however, we will only focus on Minnesota in order to keep the tutorial simple. The observations we wish to model are the radon activity found in buildings, dependent on the floor the measurement was taken and the mean county's concentration of uranium found in the soil. The covariates, floor-level and uranium concentration, are predictors for the amount of radon activity found in buildings. Each building's floor-level will be used as individual-level predictors for our model and the county's uranium concentration will be used as our group-level predictor.

Learning Outcomes

On completion of this tutorial, you should be able:

  • to prepare data for running a hierarchical regression with Bean Machine;
  • to execute a hierarchical regression with beanmachine;
  • to run diagnostics to understand what beanmachine is doing.


Our problem is to perform posterior inference over a linear model based on a hierarchical prior. Our data contains individual-level predictor measurements along with group-level predictor measurements. We will use concepts taken from the Hierarchical Model tutorial, and apply them here.


We will be using the following packages within this tutorial.

  • arviz and bokeh for interactive visualizations; and
  • pandas for data manipulation.
# Install Bean Machine in Colab if using Colab.
import sys

if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
!pip install beanmachine
import os
import arviz as az
import beanmachine.ppl as bm
import pandas as pd
import torch
import torch.distributions as dist
from beanmachine.ppl.model import RVIdentifier
from beanmachine.tutorials.utils import radon
from import output_notebook
from bokeh.plotting import gridplot, show
from IPython.display import Markdown
from torch import tensor

The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for torch for reproducibility.

# Plotting settings
az.rcParams["plot.backend"] = "bokeh"
az.rcParams["stats.hdi_prob"] = 0.89
az.rcParams["plot.max_subplots"] = 85

# Manual seed for torch

# Other settings for the notebook
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ


We will explore the data for the tutorial first so we can create a data story around them. The story will be used to inform decisions we will make about priors and sampling distributions. The unprocessed data consist of two different files, the srrs2.dat and the cty.dat, which can be found on Gelman's website. We will import a preprocessed version ready for our Bean Machine model here.

df = radon.load_data()

The data consist of the following columns.

Column nameDescription
county_indexAn index associated with each county in Minnesota.
countyThe name of the county in Minnesota.
floorThe floor on which radon activity was measured.
activityThe measured radon activity (pCi / L)
log_activitylog(activity) individual-level predictor
UppmMean concentration of uranium in a county's soil.
log_Uppmlog(Uppm) group-level predictor

The activity measurement has units of pico-Curies per liter, while the uranium is measured in parts-per-million. We will be using the log(activity) and log(Uppm) in our model, not the raw activity or Uppm values. Below we plot the raw data and the log of the raw data.

# Needed for Colab

log_plot_comparison = radon.log_plot_comparison(df["activity"])

Both plots show the histogram and the kernel density estimate of the radon activity data. The left plot shows the raw activity data, while the right plot shows the logarithm of the activity data. The shape of the data in the right plot appears to be normally distributed, and we will use this observation in our model, when we describe the model below.

We can see how the logarithm of radon activity changes as you change your floor level in any particular building in our data. We plot the log_activity vs floor below.

The floor plot is showing all measurements within the data (the markers) for each floor. We have added a small amount of horizontal jitter to the data to prevent all markers from lining up vertically. The lines show the kernel density estimate (KDE) with the horizontal dashed lines indicating the maximum value of the KDE. We can see from this plot that radon levels decrease as you move away from the basement.

# Needed for Colab

floor_plot = radon.floor_plot(df)


Our model will estimate the effects that individual-level predictors and group-level predictors have on estimating radon activity. We will incorporate the individual-level predictors found from the floor values in the srrs data, and the group-level predictors found from the uranium values in the cty data. In order to do that, we turn our earlier observation—that the response yy (the logarithmic activity level) is approximately normally distributed—into a modeling assumption in terms of a prior over yy. Thus we can state that

yNormal(μy,σy).y\sim\text{Normal}(\mu_y, \sigma_y).

We will incorporate the different predictors as linear effects into our model by noting that any effect a predictor has on our response yy, will effect the mean value μy\mu_y. Thus

μy=α+βx+γu+ϵy,\mu_y=\alpha + \beta x + \gamma u + \epsilon_y,


α=interceptβ=coefficient for log(activity)x=individual-level predictor (floor)Rγ=coefficient for log(Uppm)u=group-level predictorsRϵy=residual.\begin{aligned} \alpha &= \text{intercept} \\ \beta &= \text{coefficient for log(activity)} \\ x &= \text{individual-level predictor (floor)}\in\mathbb{R} \\ \gamma &= \text{coefficient for log(Uppm)} \\ u &= \text{group-level predictors}\in\mathbb{R} \\ \epsilon_y &= \text{residual}. \end{aligned}

The current model description above pools all data together and assumes that parameters for each county ii (αi,βi,γi)(\alpha_i, \beta_i, \gamma_i) are being sampled from the same distribution for that parameter. We know that each county will be different so we include a partial-pooling step in the model for just the α\alpha parameter. We could create a hierarchical level for each parameter, but we will keep the model simple.

The full generative model is given below.

σαHalfCauchy(5)μαNormal(0,10)αi[n]Normal(μα,σα)βNormal(0,10)γNormal(0,10)μyn=αi[n]+βxn+γui[n]σynHalfCauchy(5)ynNormal(μyn,σy),\begin{aligned} \sigma_{\alpha} &\sim \text{HalfCauchy}(5)\\ \mu_{\alpha} &\sim \text{Normal}(0, 10)\\ \alpha_{i[n]} &\sim \text{Normal}(\mu_{\alpha}, \sigma_{\alpha})\\ \beta &\sim \text{Normal}(0, 10)\\ \gamma &\sim \text{Normal}(0, 10)\\ \mu_{y_n} &= \alpha_{i[n]} + \beta x_n + \gamma u_{i[n]}\\ \sigma_{y_n} &\sim \text{HalfCauchy}(5)\\ y_n &\sim \text{Normal}(\mu_{y_n}, \sigma_y), \end{aligned}

NOTE We can implement this model in Bean Machine by defining random variable objects with the decorators @bm.random_variable and @bm.functional. These functions behave differently than ordinary Python functions.

Semantics for @bm.random_variable functions:
  • They must return PyTorch Distribution objects.
  • Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
  • Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
  • Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
  • Please see the documentation for more information about this decorator.
Semantics for @bm.functional:
  • This decorator is used to deterministically transform the results of one or more random variables.

In the below Bean Machine model we are taking advantage of batching in pytorch by using the .expand() attribute on our distributions. In our definition of alpha, we expand the random variable to be the size of the number of counties in our data; J=85J=85. We then use indexing on alpha in our definition of alpha_hat to create a tensor that is the size of our observations.

J = df["county"].unique().shape[0]
county_index = df["county_index"].tolist()
floor = tensor(df["floor"].tolist())
log_uranium = tensor(df["log_Uppm"].tolist())

def mu_alpha() -> RVIdentifier:
return dist.Normal(0, 10)

def sigma_alpha() -> RVIdentifier:
return dist.HalfCauchy(5)

def alpha() -> RVIdentifier:
return dist.Normal(mu_alpha(), sigma_alpha()).expand((J,))

def gamma() -> RVIdentifier:
return dist.Normal(0, 10)

def beta() -> RVIdentifier:
return dist.Normal(0, 10)

def sigma_y() -> RVIdentifier:
return dist.HalfCauchy(5)

def alpha_hat():
return alpha()[county_index] + gamma() * log_uranium

def y_hat():
return alpha_hat() + beta() * floor

def y() -> RVIdentifier:
return dist.Normal(y_hat(), sigma_y())

Note that the α\alpha random variable is expanding the returned tensor so that it is the same size as the number of counties we have in the data (J). Defining the model in this way allows us to run inference on yy in batches.


The priors we have chosen for the model consist of the Normal and HalfCauchy distributions. Both of these priors are weakly informative to our model. A few plots showing what these priors look like are shown below.

# Needed for Colab

sample_of_priors = radon.sample_of_priors()


We are ready to run inference on our model and observations. Bean Machine supports a rich library of inference algorithms that share a common infer method with the following arguments:

queriesList of @bm.random_variable targets to fit posterior distributions for.
observationsA dictionary of observations, as built above.
num_samplesNumber of Monte Carlo samples to approximate the posterior distributions for the variables in queries.
num_chainsNumber of separate inference runs to use. Multiple chains can help verify that inference ran correctly.

For this particular problem, we will use the GlobalNoUTurnSampler inference method. Bean Machine's GlobalNoUTurnSampler is the NUTS sampler you have probably used before with stan or pymc3. We have chosen to use the NUTS sampler here because it can be easily compared to other probabilistic tools.

num_samples = 2 if smoke_test else 5000
num_adaptive_samples = 0 if smoke_test else num_samples // 2
num_chains = 4
# Experimental backend using the Pytorch NNC compiler
nnc_compile = "SANDCASTLE_NEXUS" not in os.environ

# Data for the model.
observations = {y(): tensor(df["log_activity"].values)}

# The parameters to our model are our queries.
queries = [

# Run the inference.
samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(

Samples collected: 0%| | 0/7500 [00:00<?, ?it/s]

Samples collected: 0%| | 0/7500 [00:00<?, ?it/s]

Samples collected: 0%| | 0/7500 [00:00<?, ?it/s]

Samples collected: 0%| | 0/7500 [00:00<?, ?it/s]


We begin our analysis by printing out summary statistics. Two important statistics to take note of are the R^\hat{R} (r_hat) and effective sample size (ess) values in the below dataframe, see Vehtari et al.

summary_df = az.summary(samples.to_xarray(), round_to=3)

R^\hat{R} diagnostic

R^\hat{R} is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then R^1\hat{R}\approx 1, since the between-chain and within-chain variance should be equal. R^\hat{R} is calculated as


where WW is the within-chain variance and V^\hat{V} is the posterior variance estimate for the pooled rank-traces. The take-away here is that R^\hat{R} converges towards 1 when each of the Markov chains approaches perfect adaptation to the true posterior distribution. We do not recommend using inference results if R^>1.01\hat{R}>1.01. More information about R^\hat{R} can be found in the Vehtari et al paper.

Effective sample size essess diagnostic

MCMC samplers do not draw independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper, briefly it is a measure that combines information from the R^\hat{R} value with the autocorrelation estimates within the chains. There are many ways to estimate effective samples sizes, however, we will be using the method defined in the Vehtari et al paper.

The rule of thumb for ess_bulk is for this value to be greater than 100 per chain on average. Since we ran four chains, we need ess_bulk to be greater than 400 for each parameter. The ess_tail is an estimate for effectively independent samples considering the more extreme values of the posterior. This is not the number of samples that landed in the tails of the posterior. It is a measure of the number of effectively independent samples if we sampled the tails of the posterior. The rule of thumb for this value is also to be greater than 100 per chain on average.

We can use arviz to plot the evolution of the effective sample size for each of our parameters. The below plots show over each successive draw from our model, the estimated effective sample size for our parameters. The red horizontal dashed line shows the rule-of-thumb value needed for both the bulk and tail ess estimates. When the model is converging properly, both the bulk and tail lines should be roughly linear.

# Needed for Colab

keys = ["μ_α", "σ_α", "β", "γ", "σ_y"]
values = [mu_alpha(), sigma_alpha(), beta(), gamma(), sigma_y()]
parameters = dict(zip(keys, values))
plots = []
for title, parameter in parameters.items():
data = {title: samples.get(parameter).numpy()}
f = az.plot_ess(data, kind="evolution", show=False)[0][0]
f.plot_width = 275
f.plot_height = 275
f.y_range.start = 0
f.outline_line_color = "black"
f.grid.grid_line_alpha = 0.2
f.grid.grid_line_color = "grey"
f.grid.grid_line_width = 0.2
query_ess = gridplot([plots[:3], plots[3:]])

Rather than plot all effective sample size evolutions for each county's α\alpha parameter, we will do so for a sample.

# Needed for Colab

sample_counties = {
item["county_index"]: item["county"]
for item in df[
][["county_index", "county"]]
data = {"α": samples.get(alpha()).numpy()}
alpha_ess = az.plot_ess(data, kind="evolution", show=False).reshape(-1)

plots = []
for index, county in sample_counties.items():
f = alpha_ess[index]
f.title.text = f"α[{county}]"
f.plot_width = 275
f.plot_height = 275
f.y_range.start = 0
f.outline_line_color = "black"
f.grid.grid_line_alpha = 0.2
f.grid.grid_line_color = "grey"
f.grid.grid_line_width = 0.2
sample_counties_ess = gridplot([plots[:3], plots[3:6], plots[6:]])

Combined with our above analysis, we can show the rank plots of our model parameters. You might be familiar with trace plots that show parameter posterior distributions along with the "fuzzy caterpillar" plots of sampled values. Rank plots perform the same function as trace plots in that they show problems in the mixing of the chains. Rank plots for our model's parameters are shown below.

# Needed for Colab

trace_ranks = radon.plot_trace_ranks(
keys=["μ_α", "σ_α", "β", "γ", "σ_y"],
values=[mu_alpha(), sigma_alpha(), beta(), gamma(), sigma_y()],

Each of the above rank plots shows fairly uniform histograms for each parameter.

Recall that β\beta is the coefficient for the log_activity data. When we plotted the logarithm of radon activity vs the floor measurement we saw that there was a decrease in log(radon)log(radon) as we moved from the basement to the first floor of a building. From our initial data analysis, we expected the coefficient to be negative. Plotting the posterior for β\beta (below) we can see that it is always less than zero with a mean value of -0.64 and the 89% highest density interval between [-0.75, -0.54]. Similarly we can plot the posterior distribution for the coefficient of log(uranium)log(uranium), which were our group-level predictors.

Why an 89% highest density interval (HDI)? To prevent you from thinking about a 95% confidence interval. See McElreath for further discussion on this subject.

# Needed for Colab

beta_posterior = az.plot_posterior({"β": samples.get(beta()).numpy()}, show=False,)[
gamma_posterior = az.plot_posterior({"γ": samples.get(gamma()).numpy()}, show=False,)[
beta_gamma = gridplot([[beta_posterior, gamma_posterior]])

Again we only show a sampling of the α\alpha parameters for the different counties in the model.

# Needed for Colab

alphas = samples.get(alpha()).numpy()
data = {
f"α[{county}]": alphas[:, :, index] for index, county in sample_counties.items()
county_posteriors = gridplot([*az.plot_posterior(data, show=False)])

We can inspect how the uranium concentration in a county's soil affects the intercept α\alpha by plotting our intercept estimate against the logarithm of uranium counts for the county. The whiskers/error-bars for each marker is its standard deviation from the posterior distribution.

# Needed for Colab

uranium = radon.uranium(summary_df, df)