Hierarchical regression
Tutorial: Hierarchical regression​
In this tutorial we will explore linear regression in combination with hierarchical priors. We will be using data from Gelman and Hill on radon levels found in buildings in Minnesota; Hill J and Gelman A. The original paper about radon concentrations in Minnesota can be found in Price. Measurements on radon activity were taken for many of the states, however, we will only focus on Minnesota in order to keep the tutorial simple. The observations we wish to model are the radon activity found in buildings, dependent on the floor the measurement was taken and the mean county's concentration of uranium found in the soil. The covariates, floor-level and uranium concentration, are predictors for the amount of radon activity found in buildings. Each building's floor-level will be used as individual-level predictors for our model and the county's uranium concentration will be used as our group-level predictor.
Learning Outcomes​
On completion of this tutorial, you should be able:
- to prepare data for running a hierarchical regression with Bean Machine;
- to execute a hierarchical regression with beanmachine;
- to run diagnostics to understand what beanmachine is doing.
Problem​
Our problem is to perform posterior inference over a linear model based on a
hierarchical prior. Our data contains individual-level predictor measurements along with
group-level predictor measurements. We will use concepts taken from the
Hierarchical Model
tutorial, and apply them here.
Prerequisites​
We will be using the following packages within this tutorial.
# Install Bean Machine in Colab if using Colab.
import sys
if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
!pip install beanmachine
import os
import arviz as az
import beanmachine.ppl as bm
import pandas as pd
import torch
import torch.distributions as dist
from beanmachine.ppl.model import RVIdentifier
from beanmachine.tutorials.utils import radon
from bokeh.io import output_notebook
from bokeh.plotting import gridplot, show
from IPython.display import Markdown
from torch import tensor
The next cell includes convenient configuration settings to improve the notebook
presentation as well as setting a manual seed for torch
for reproducibility.
# Plotting settings
az.rcParams["plot.backend"] = "bokeh"
az.rcParams["stats.hdi_prob"] = 0.89
az.rcParams["plot.max_subplots"] = 85
# Manual seed for torch
torch.manual_seed(1199)
# Other settings for the notebook
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ
Data​
We will explore the data for the tutorial first so we can create a data story around
them. The story will be used to inform decisions we will make about priors and sampling
distributions. The unprocessed data consist of two different files, the srrs2.dat
and
the cty.dat
, which can be found on Gelman's website. We will import a
preprocessed version ready for our Bean Machine model here.
df = radon.load_data()
Markdown(df.head().set_index("county_index").to_markdown())
county_index | county | floor | activity | log_activity | Uppm | log_Uppm |
---|---|---|---|---|---|---|
0 | AITKIN | 1 | 2.2 | 0.832909 | 0.502054 | -0.689048 |
0 | AITKIN | 0 | 2.2 | 0.832909 | 0.502054 | -0.689048 |
0 | AITKIN | 0 | 2.9 | 1.09861 | 0.502054 | -0.689048 |
0 | AITKIN | 0 | 1 | 0.0953102 | 0.502054 | -0.689048 |
1 | ANOKA | 0 | 2.8 | 1.06471 | 0.428565 | -0.847313 |
The data consist of the following columns.
Column name | Description |
---|---|
county_index | An index associated with each county in Minnesota. |
county | The name of the county in Minnesota. |
floor | The floor on which radon activity was measured. |
activity | The measured radon activity (pCi / L) |
log_activity | log(activity) individual-level predictor |
Uppm | Mean concentration of uranium in a county's soil. |
log_Uppm | log(Uppm) group-level predictor |
The activity measurement has units of pico-Curies per liter, while the uranium is
measured in parts-per-million. We will be using the log(activity)
and log(Uppm)
in
our model, not the raw activity
or Uppm
values. Below we plot the raw data and the
log of the raw data.
# Needed for Colab
output_notebook()
log_plot_comparison = radon.log_plot_comparison(df["activity"])
show(log_plot_comparison)
Both plots show the histogram and the kernel density estimate of the radon activity data. The left plot shows the raw activity data, while the right plot shows the logarithm of the activity data. The shape of the data in the right plot appears to be normally distributed, and we will use this observation in our model, when we describe the model below.
We can see how the logarithm of radon activity changes as you change your floor level in
any particular building in our data. We plot the log_activity
vs floor
below.
The floor plot is showing all measurements within the data (the markers) for each floor. We have added a small amount of horizontal jitter to the data to prevent all markers from lining up vertically. The lines show the kernel density estimate (KDE) with the horizontal dashed lines indicating the maximum value of the KDE. We can see from this plot that radon levels decrease as you move away from the basement.
# Needed for Colab
output_notebook()
floor_plot = radon.floor_plot(df)
show(floor_plot)
Model​
Our model will estimate the effects that individual-level predictors and group-level
predictors have on estimating radon activity. We will incorporate the individual-level
predictors found from the floor values in the srrs
data, and the group-level
predictors found from the uranium values in the cty
data. In order to do that, we turn
our earlier observation—that the response (the logarithmic activity level) is
approximately normally distributed—into a modeling assumption in terms of a prior over
. Thus we can state that
We will incorporate the different predictors as linear effects into our model by noting that any effect a predictor has on our response , will effect the mean value . Thus
where
The current model description above pools all data together and assumes that parameters for each county are being sampled from the same distribution for that parameter. We know that each county will be different so we include a partial-pooling step in the model for just the parameter. We could create a hierarchical level for each parameter, but we will keep the model simple.
The full generative model is given below.
NOTE
We can implement this model in Bean Machine by defining random variable
objects with the decorators @bm.random_variable
and @bm.functional
. These functions
behave differently than ordinary Python functions.
@bm.random_variable
functions:- They must return PyTorch
Distribution
objects. - Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
- Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
- Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
- Please see the documentation for more information about this decorator.
@bm.functional
:- This decorator is used to deterministically transform the results of one or more random variables.
In the below Bean Machine model we are taking advantage of batching in pytorch
by
using the .expand()
attribute on our distributions. In our definition of alpha
, we
expand the random variable to be the size of the number of counties in our data; .
We then use indexing on alpha
in our definition of alpha_hat
to create a tensor that
is the size of our observations.
J = df["county"].unique().shape[0]
county_index = df["county_index"].tolist()
floor = tensor(df["floor"].tolist())
log_uranium = tensor(df["log_Uppm"].tolist())
@bm.random_variable
def mu_alpha() -> RVIdentifier:
return dist.Normal(0, 10)
@bm.random_variable
def sigma_alpha() -> RVIdentifier:
return dist.HalfCauchy(5)
@bm.random_variable
def alpha() -> RVIdentifier:
return dist.Normal(mu_alpha(), sigma_alpha()).expand((J,))
@bm.random_variable
def gamma() -> RVIdentifier:
return dist.Normal(0, 10)
@bm.random_variable
def beta() -> RVIdentifier:
return dist.Normal(0, 10)
@bm.random_variable
def sigma_y() -> RVIdentifier:
return dist.HalfCauchy(5)
@bm.functional
def alpha_hat():
return alpha()[county_index] + gamma() * log_uranium
@bm.functional
def y_hat():
return alpha_hat() + beta() * floor
@bm.random_variable
def y() -> RVIdentifier:
return dist.Normal(y_hat(), sigma_y())
Note that the random variable is expanding the returned tensor so that it is
the same size as the number of counties we have in the data (J
). Defining the model in
this way allows us to run inference on in batches.
Priors​
The priors we have chosen for the model consist of the Normal and HalfCauchy distributions. Both of these priors are weakly informative to our model. A few plots showing what these priors look like are shown below.
# Needed for Colab
output_notebook(hide_banner=True)
sample_of_priors = radon.sample_of_priors()
show(sample_of_priors)
Inference​
We are ready to run inference on our model and observations. Bean Machine supports a rich library of inference algorithms that share a common infer method with the following arguments:
Name | Usage |
---|---|
queries | List of @bm.random_variable targets to fit posterior distributions for. |
observations | A dictionary of observations, as built above. |
num_samples | Number of Monte Carlo samples to approximate the posterior distributions for the variables in queries . |
num_chains | Number of separate inference runs to use. Multiple chains can help verify that inference ran correctly. |
For this particular problem, we will use the GlobalNoUTurnSampler
inference method.
Bean Machine's GlobalNoUTurnSampler
is the NUTS sampler you have probably used before
with stan
or pymc3
. We have chosen to use the NUTS sampler here because it can be
easily compared to other probabilistic tools.
num_samples = 2 if smoke_test else 5000
num_adaptive_samples = 0 if smoke_test else num_samples // 2
num_chains = 4
# Experimental backend using the Pytorch NNC compiler
nnc_compile = "SANDCASTLE_NEXUS" not in os.environ
# Data for the model.
observations = {y(): tensor(df["log_activity"].values)}
# The parameters to our model are our queries.
queries = [
alpha(),
mu_alpha(),
sigma_alpha(),
alpha_hat(),
beta(),
gamma(),
sigma_y(),
y_hat(),
]
# Run the inference.
samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(
queries=queries,
observations=observations,
num_samples=num_samples,
num_chains=num_chains,
num_adaptive_samples=num_adaptive_samples,
)
Analysis​
We begin our analysis by printing out summary statistics. Two important statistics to
take note of are the (r_hat
) and effective sample size (ess
) values in the
below dataframe, see Vehtari et al.
summary_df = az.summary(samples.to_xarray(), round_to=3)
Markdown(summary_df.sample(20).T.to_markdown())
alpha_hat()[565] | y_hat()[467] | alpha_hat()[248] | alpha_hat()[364] | y_hat()[509] | y_hat()[918] | y_hat()[746] | alpha_hat()[568] | alpha_hat()[94] | y_hat()[829] | y_hat()[260] | y_hat()[43] | alpha_hat()[825] | y_hat()[268] | alpha_hat()[576] | y_hat()[742] | alpha_hat()[376] | y_hat()[552] | alpha_hat()[518] | y_hat()[632] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | 1.689 | 0.855 | 1.779 | 1.398 | 1.748 | 1.7 | 0.966 | 1.659 | 1.196 | 1.651 | 1.113 | 0.917 | 1.651 | 1.398 | 1.196 | 0.966 | 1.222 | 0.771 | 1.639 | 1.458 |
sd | 0.146 | 0.139 | 0.142 | 0.063 | 0.149 | 0.158 | 0.065 | 0.153 | 0.13 | 0.139 | 0.139 | 0.086 | 0.139 | 0.063 | 0.098 | 0.065 | 0.142 | 0.158 | 0.155 | 0.132 |
hdi_5.5% | 1.455 | 0.631 | 1.559 | 1.3 | 1.515 | 1.45 | 0.861 | 1.414 | 1.003 | 1.438 | 0.884 | 0.779 | 1.438 | 1.3 | 1.044 | 0.861 | 0.998 | 0.52 | 1.393 | 1.259 |
hdi_94.5% | 1.913 | 1.069 | 2.007 | 1.499 | 1.979 | 1.947 | 1.071 | 1.898 | 1.414 | 1.875 | 1.326 | 1.052 | 1.875 | 1.499 | 1.355 | 1.071 | 1.452 | 1.019 | 1.875 | 1.669 |
mcse_mean | 0.002 | 0.004 | 0.003 | 0.001 | 0.003 | 0.002 | 0.001 | 0.002 | 0.002 | 0.004 | 0.002 | 0.001 | 0.004 | 0.001 | 0.001 | 0.001 | 0.002 | 0.003 | 0.004 | 0.006 |
mcse_sd | 0.001 | 0.003 | 0.002 | 0.001 | 0.002 | 0.002 | 0.001 | 0.002 | 0.002 | 0.003 | 0.002 | 0.001 | 0.003 | 0.001 | 0.001 | 0.001 | 0.001 | 0.002 | 0.003 | 0.004 |
ess_bulk | 4414.84 | 1421.16 | 1702.2 | 6119.03 | 2941.53 | 5056.55 | 2508.64 | 5112.09 | 3544.69 | 1088.88 | 3278.22 | 5764.55 | 1088.88 | 6119.03 | 4347.95 | 2508.64 | 5918.05 | 4067.78 | 1534.74 | 532.076 |
ess_tail | 5613.32 | 4614.7 | 5194.52 | 6294.34 | 5529.94 | 6458.84 | 4256.39 | 4780.85 | 5693.09 | 6977.78 | 2828.78 | 10452.3 | 6977.78 | 6294.34 | 7064.77 | 4256.39 | 7357.53 | 4266.7 | 6050 | 2130.18 |
r_hat | 1.002 | 1.002 | 1.011 | 1.009 | 1.008 | 1.001 | 1.013 | 1.008 | 1.002 | 1.005 | 1.001 | 1.001 | 1.005 | 1.009 | 1.006 | 1.013 | 1.002 | 1.009 | 1.005 | 1.006 |
diagnostic​
is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then , since the between-chain and within-chain variance should be equal. is calculated as
where is the within-chain variance and is the posterior variance estimate for the pooled rank-traces. The take-away here is that converges towards 1 when each of the Markov chains approaches perfect adaptation to the true posterior distribution. We do not recommend using inference results if . More information about can be found in the Vehtari et al paper.
Effective sample size diagnostic​
MCMC samplers do not draw independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper, briefly it is a measure that combines information from the value with the autocorrelation estimates within the chains. There are many ways to estimate effective samples sizes, however, we will be using the method defined in the Vehtari et al paper.
The rule of thumb for ess_bulk
is for this value to be greater than 100 per chain on
average. Since we ran four chains, we need ess_bulk
to be greater than 400 for each
parameter. The ess_tail
is an estimate for effectively independent samples considering
the more extreme values of the posterior. This is not the number of samples that landed
in the tails of the posterior. It is a measure of the number of effectively independent
samples if we sampled the tails of the posterior. The rule of thumb for this value is
also to be greater than 100 per chain on average.
We can use arviz
to plot the evolution of the effective sample size for each of our
parameters. The below plots show over each successive draw from our model, the estimated
effective sample size for our parameters. The red horizontal dashed line shows the
rule-of-thumb value needed for both the bulk and tail ess
estimates. When the model is
converging properly, both the bulk and tail lines should be roughly linear.
# Needed for Colab
output_notebook(hide_banner=True)
keys = ["μ_α", "σ_α", "β", "γ", "σ_y"]
values = [mu_alpha(), sigma_alpha(), beta(), gamma(), sigma_y()]
parameters = dict(zip(keys, values))
plots = []
for title, parameter in parameters.items():
data = {title: samples.get(parameter).numpy()}
f = az.plot_ess(data, kind="evolution", show=False)[0][0]
f.plot_width = 275
f.plot_height = 275
f.y_range.start = 0
f.outline_line_color = "black"
f.grid.grid_line_alpha = 0.2
f.grid.grid_line_color = "grey"
f.grid.grid_line_width = 0.2
plots.append(f)
query_ess = gridplot([plots[:3], plots[3:]])
show(query_ess)
Rather than plot all effective sample size evolutions for each county's parameter, we will do so for a sample.
# Needed for Colab
output_notebook(hide_banner=True)
sample_counties = {
item["county_index"]: item["county"]
for item in df[
df["county"].isin(
[
"LAC QUI PARLE",
"AITKIN",
"KOOCHICHING",
"DOUGLAS",
"CLAY",
"STEARNS",
"RAMSEY",
"ST. LOUIS",
]
)
][["county_index", "county"]]
.drop_duplicates()
.to_dict(orient="records")
}
data = {"α": samples.get(alpha()).numpy()}
alpha_ess = az.plot_ess(data, kind="evolution", show=False).reshape(-1)
plots = []
for index, county in sample_counties.items():
f = alpha_ess[index]
f.title.text = f"α[{county}]"
f.plot_width = 275
f.plot_height = 275
f.y_range.start = 0
f.outline_line_color = "black"
f.grid.grid_line_alpha = 0.2
f.grid.grid_line_color = "grey"
f.grid.grid_line_width = 0.2
plots.append(f)
sample_counties_ess = gridplot([plots[:3], plots[3:6], plots[6:]])
show(sample_counties_ess)
Combined with our above analysis, we can show the rank plots of our model parameters.
You might be familiar with trace plots
that show parameter posterior distributions
along with the "fuzzy caterpillar" plots of sampled values. Rank plots perform the same
function as trace plots in that they show problems in the mixing of the chains. Rank
plots for our model's parameters are shown below.
# Needed for Colab
output_notebook()
trace_ranks = radon.plot_trace_ranks(
keys=["μ_α", "σ_α", "β", "γ", "σ_y"],
values=[mu_alpha(), sigma_alpha(), beta(), gamma(), sigma_y()],
samples=samples,
)
show(trace_ranks)
Each of the above rank plots shows fairly uniform histograms for each parameter.
Recall that is the coefficient for the log_activity
data. When we plotted the
logarithm of radon activity vs the floor measurement we saw that there was a decrease
in as we moved from the basement to the first floor of a building. From our
initial data analysis, we expected the coefficient to be negative. Plotting the
posterior for (below) we can see that it is always less than zero with a mean
value of -0.64 and the 89% highest density interval between [-0.75, -0.54]. Similarly
we can plot the posterior distribution for the coefficient of , which were
our group-level predictors.
Why an 89% highest density interval (HDI)? To prevent you from thinking about a 95% confidence interval. See McElreath for further discussion on this subject.
# Needed for Colab
output_notebook()
beta_posterior = az.plot_posterior({"β": samples.get(beta()).numpy()}, show=False,)[
0
][0]
gamma_posterior = az.plot_posterior({"γ": samples.get(gamma()).numpy()}, show=False,)[
0
][0]
beta_gamma = gridplot([[beta_posterior, gamma_posterior]])
show(beta_gamma)
Again we only show a sampling of the parameters for the different counties in the model.
# Needed for Colab
output_notebook()
alphas = samples.get(alpha()).numpy()
data = {
f"α[{county}]": alphas[:, :, index] for index, county in sample_counties.items()
}
county_posteriors = gridplot([*az.plot_posterior(data, show=False)])
show(county_posteriors)
We can inspect how the uranium concentration in a county's soil affects the intercept by plotting our intercept estimate against the logarithm of uranium counts for the county. The whiskers/error-bars for each marker is its standard deviation from the posterior distribution.
# Needed for Colab
output_notebook()
uranium = radon.uranium(summary_df, df)
show(uranium)
References​
- Gelman's website http://www.stat.columbia.edu/~gelman/arm/examples/radon/
- Hill J and Gelman A (2006) Data Analysis Using Regression and Multilevel/Hierarchical Models. United States: Cambridge University Press. ISBN: 9781139460934.
- McElreath R (2020) Statistical Rethinking: A Bayesian Course with Examples in R and Stan 2nd edition. Chapman and Hall/CRC. doi: 10.1201/9780429029608
- Price PN, Nero AV, Gelman A (1996) Bayesian Prediction of Mean Indoor Radon Concentrations for Minnesota Counties. Health Physics 71(6). doi: 10.1097/00004032-199612000-00009
- Wikipedia (Decay chain) https://en.wikipedia.org/wiki/Decay_chain#Uranium_series
- Vehtari A, Gelman A, Simpson D, Carpenter B, Bürkner PC (2021) Rank-Normalization, Folding, and Localization: An Improved for Assessing Convergence of MCMC (with Discussion). Bayesian Analysis 16(2) 667–718. doi: 10.1214/20-BA1221.