# Hierarchical regression

## Tutorial: **Hierarchical regression**â€‹

In this tutorial we will explore linear regression in combination with hierarchical priors. We will be using data from Gelman and Hill on radon levels found in buildings in Minnesota; Hill J and Gelman A. The original paper about radon concentrations in Minnesota can be found in Price. Measurements on radon activity were taken for many of the states, however, we will only focus on Minnesota in order to keep the tutorial simple. The observations we wish to model are the radon activity found in buildings, dependent on the floor the measurement was taken and the mean county's concentration of uranium found in the soil. The covariates, floor-level and uranium concentration, are predictors for the amount of radon activity found in buildings. Each building's floor-level will be used as individual-level predictors for our model and the county's uranium concentration will be used as our group-level predictor.

## Learning Outcomesâ€‹

On completion of this tutorial, you should be able:

- to prepare data for running a hierarchical regression with Bean Machine;
- to execute a hierarchical regression with beanmachine;
- to run diagnostics to understand what beanmachine is doing.

## Problemâ€‹

Our problem is to perform posterior inference over a linear model based on a
hierarchical prior. Our data contains individual-level predictor measurements along with
group-level predictor measurements. We will use concepts taken from the
`Hierarchical Model`

tutorial, and apply them here.

## Prerequisitesâ€‹

We will be using the following packages within this tutorial.

`# Install Bean Machine in Colab if using Colab.`

import sys

if "google.colab" in sys.modules and "beanmachine" not in sys.modules:

!pip install beanmachine

`import os`

import arviz as az

import beanmachine.ppl as bm

import pandas as pd

import torch

import torch.distributions as dist

from beanmachine.ppl.model import RVIdentifier

from beanmachine.tutorials.utils import radon

from bokeh.io import output_notebook

from bokeh.plotting import gridplot, show

from IPython.display import Markdown

from torch import tensor

The next cell includes convenient configuration settings to improve the notebook
presentation as well as setting a manual seed for `torch`

for reproducibility.

`# Plotting settings`

az.rcParams["plot.backend"] = "bokeh"

az.rcParams["stats.hdi_prob"] = 0.89

az.rcParams["plot.max_subplots"] = 85

# Manual seed for torch

torch.manual_seed(1199)

# Other settings for the notebook

smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ

## Dataâ€‹

We will explore the data for the tutorial first so we can create a data story around
them. The story will be used to inform decisions we will make about priors and sampling
distributions. The unprocessed data consist of two different files, the `srrs2.dat`

and
the `cty.dat`

, which can be found on Gelman's website. We will import a
preprocessed version ready for our Bean Machine model here.

`df = radon.load_data()`

Markdown(df.head().set_index("county_index").to_markdown())

county_index | county | floor | activity | log_activity | Uppm | log_Uppm |
---|---|---|---|---|---|---|

0 | AITKIN | 1 | 2.2 | 0.832909 | 0.502054 | -0.689048 |

0 | AITKIN | 0 | 2.2 | 0.832909 | 0.502054 | -0.689048 |

0 | AITKIN | 0 | 2.9 | 1.09861 | 0.502054 | -0.689048 |

0 | AITKIN | 0 | 1 | 0.0953102 | 0.502054 | -0.689048 |

1 | ANOKA | 0 | 2.8 | 1.06471 | 0.428565 | -0.847313 |

The data consist of the following columns.

Column name | Description |
---|---|

county_index | An index associated with each county in Minnesota. |

county | The name of the county in Minnesota. |

floor | The floor on which radon activity was measured. |

activity | The measured radon activity (pCi / L) |

log_activity | log(activity) `individual-level predictor` |

Uppm | Mean concentration of uranium in a county's soil. |

log_Uppm | log(Uppm) `group-level predictor` |

The activity measurement has units of pico-Curies per liter, while the uranium is
measured in parts-per-million. We will be using the `log(activity)`

and `log(Uppm)`

in
our model, not the raw `activity`

or `Uppm`

values. Below we plot the raw data and the
log of the raw data.

`# Needed for Colab`

output_notebook()

log_plot_comparison = radon.log_plot_comparison(df["activity"])

show(log_plot_comparison)

Both plots show the histogram and the kernel density estimate of the radon activity data. The left plot shows the raw activity data, while the right plot shows the logarithm of the activity data. The shape of the data in the right plot appears to be normally distributed, and we will use this observation in our model, when we describe the model below.

We can see how the logarithm of radon activity changes as you change your floor level in
any particular building in our data. We plot the `log_activity`

*vs* `floor`

below.

The floor plot is showing all measurements within the data (the markers) for each floor.
We have added a small amount of *horizontal jitter* to the data to prevent all markers
from lining up vertically. The lines show the kernel density estimate (KDE) with the
horizontal dashed lines indicating the maximum value of the KDE. We can see from this
plot that radon levels decrease as you move away from the basement.

`# Needed for Colab`

output_notebook()

floor_plot = radon.floor_plot(df)

show(floor_plot)

## Modelâ€‹

Our model will estimate the effects that individual-level predictors and group-level
predictors have on estimating radon activity. We will incorporate the individual-level
predictors found from the floor values in the `srrs`

data, and the group-level
predictors found from the uranium values in the `cty`

data. In order to do that, we turn
our earlier observationâ€”that the response $y$ (the logarithmic activity level) is
approximately normally distributedâ€”into a modeling assumption in terms of a prior over
$y$. Thus we can state that

We will incorporate the different predictors as linear effects into our model by noting that any effect a predictor has on our response $y$, will effect the mean value $\mu_y$. Thus

where

The current model description above pools all data together and assumes that parameters for each county $i$ $(\alpha_i, \beta_i, \gamma_i)$ are being sampled from the same distribution for that parameter. We know that each county will be different so we include a partial-pooling step in the model for just the $\alpha$ parameter. We could create a hierarchical level for each parameter, but we will keep the model simple.

The full generative model is given below.

** NOTE** We can implement this model in Bean Machine by defining random variable
objects with the decorators

`@bm.random_variable`

and `@bm.functional`

. These functions
behave differently than ordinary Python functions.`@bm.random_variable`

functions:- They must return PyTorch
`Distribution`

objects. - Though they return distributions, callees actually receive
*samples*from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine. - Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value:
**calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration**. This makes it easy for multiple components of your model to refer to the same logical random variable. - Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
- Please see the documentation for more information about this decorator.

`@bm.functional`

:- This decorator is used to deterministically transform the results of one or more random variables.

In the below Bean Machine model we are taking advantage of batching in `pytorch`

by
using the `.expand()`

attribute on our distributions. In our definition of `alpha`

, we
expand the random variable to be the size of the number of counties in our data; $J=85$.
We then use indexing on `alpha`

in our definition of `alpha_hat`

to create a tensor that
is the size of our observations.

`J = df["county"].unique().shape[0]`

county_index = df["county_index"].tolist()

floor = tensor(df["floor"].tolist())

log_uranium = tensor(df["log_Uppm"].tolist())

@bm.random_variable

def mu_alpha() -> RVIdentifier:

return dist.Normal(0, 10)

@bm.random_variable

def sigma_alpha() -> RVIdentifier:

return dist.HalfCauchy(5)

@bm.random_variable

def alpha() -> RVIdentifier:

return dist.Normal(mu_alpha(), sigma_alpha()).expand((J,))

@bm.random_variable

def gamma() -> RVIdentifier:

return dist.Normal(0, 10)

@bm.random_variable

def beta() -> RVIdentifier:

return dist.Normal(0, 10)

@bm.random_variable

def sigma_y() -> RVIdentifier:

return dist.HalfCauchy(5)

@bm.functional

def alpha_hat():

return alpha()[county_index] + gamma() * log_uranium

@bm.functional

def y_hat():

return alpha_hat() + beta() * floor

@bm.random_variable

def y() -> RVIdentifier:

return dist.Normal(y_hat(), sigma_y())

Note that the $\alpha$ random variable is expanding the returned tensor so that it is
the same size as the number of counties we have in the data (`J`

). Defining the model in
this way allows us to run inference on $y$ in *batches*.

### Priorsâ€‹

The priors we have chosen for the model consist of the Normal and HalfCauchy distributions. Both of these priors are weakly informative to our model. A few plots showing what these priors look like are shown below.

`# Needed for Colab`

output_notebook(hide_banner=True)

sample_of_priors = radon.sample_of_priors()

show(sample_of_priors)

### Inferenceâ€‹

We are ready to run inference on our model and observations. Bean Machine supports a rich library of inference algorithms that share a common infer method with the following arguments:

Name | Usage |
---|---|

`queries` | List of `@bm.random_variable` targets to fit posterior distributions for. |

`observations` | A dictionary of observations, as built above. |

`num_samples` | Number of Monte Carlo samples to approximate the posterior distributions for the variables in `queries` . |

`num_chains` | Number of separate inference runs to use. Multiple chains can help verify that inference ran correctly. |

For this particular problem, we will use the `GlobalNoUTurnSampler`

inference method.
Bean Machine's `GlobalNoUTurnSampler`

is the NUTS sampler you have probably used before
with `stan`

or `pymc3`

. We have chosen to use the NUTS sampler here because it can be
easily compared to other probabilistic tools.

`num_samples = 2 if smoke_test else 5000`

num_adaptive_samples = 0 if smoke_test else num_samples // 2

num_chains = 4

# Experimental backend using the Pytorch NNC compiler

nnc_compile = "SANDCASTLE_NEXUS" not in os.environ

# Data for the model.

observations = {y(): tensor(df["log_activity"].values)}

# The parameters to our model are our queries.

queries = [

alpha(),

mu_alpha(),

sigma_alpha(),

alpha_hat(),

beta(),

gamma(),

sigma_y(),

y_hat(),

]

# Run the inference.

samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(

queries=queries,

observations=observations,

num_samples=num_samples,

num_chains=num_chains,

num_adaptive_samples=num_adaptive_samples,

)

### Analysisâ€‹

We begin our analysis by printing out summary statistics. Two important statistics to
take note of are the $\hat{R}$ (`r_hat`

) and effective sample size (`ess`

) values in the
below dataframe, see Vehtari *et al*.

`summary_df = az.summary(samples.to_xarray(), round_to=3)`

Markdown(summary_df.sample(20).T.to_markdown())

alpha_hat()[565] | y_hat()[467] | alpha_hat()[248] | alpha_hat()[364] | y_hat()[509] | y_hat()[918] | y_hat()[746] | alpha_hat()[568] | alpha_hat()[94] | y_hat()[829] | y_hat()[260] | y_hat()[43] | alpha_hat()[825] | y_hat()[268] | alpha_hat()[576] | y_hat()[742] | alpha_hat()[376] | y_hat()[552] | alpha_hat()[518] | y_hat()[632] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

mean | 1.689 | 0.855 | 1.779 | 1.398 | 1.748 | 1.7 | 0.966 | 1.659 | 1.196 | 1.651 | 1.113 | 0.917 | 1.651 | 1.398 | 1.196 | 0.966 | 1.222 | 0.771 | 1.639 | 1.458 |

sd | 0.146 | 0.139 | 0.142 | 0.063 | 0.149 | 0.158 | 0.065 | 0.153 | 0.13 | 0.139 | 0.139 | 0.086 | 0.139 | 0.063 | 0.098 | 0.065 | 0.142 | 0.158 | 0.155 | 0.132 |

hdi_5.5% | 1.455 | 0.631 | 1.559 | 1.3 | 1.515 | 1.45 | 0.861 | 1.414 | 1.003 | 1.438 | 0.884 | 0.779 | 1.438 | 1.3 | 1.044 | 0.861 | 0.998 | 0.52 | 1.393 | 1.259 |

hdi_94.5% | 1.913 | 1.069 | 2.007 | 1.499 | 1.979 | 1.947 | 1.071 | 1.898 | 1.414 | 1.875 | 1.326 | 1.052 | 1.875 | 1.499 | 1.355 | 1.071 | 1.452 | 1.019 | 1.875 | 1.669 |

mcse_mean | 0.002 | 0.004 | 0.003 | 0.001 | 0.003 | 0.002 | 0.001 | 0.002 | 0.002 | 0.004 | 0.002 | 0.001 | 0.004 | 0.001 | 0.001 | 0.001 | 0.002 | 0.003 | 0.004 | 0.006 |

mcse_sd | 0.001 | 0.003 | 0.002 | 0.001 | 0.002 | 0.002 | 0.001 | 0.002 | 0.002 | 0.003 | 0.002 | 0.001 | 0.003 | 0.001 | 0.001 | 0.001 | 0.001 | 0.002 | 0.003 | 0.004 |

ess_bulk | 4414.84 | 1421.16 | 1702.2 | 6119.03 | 2941.53 | 5056.55 | 2508.64 | 5112.09 | 3544.69 | 1088.88 | 3278.22 | 5764.55 | 1088.88 | 6119.03 | 4347.95 | 2508.64 | 5918.05 | 4067.78 | 1534.74 | 532.076 |

ess_tail | 5613.32 | 4614.7 | 5194.52 | 6294.34 | 5529.94 | 6458.84 | 4256.39 | 4780.85 | 5693.09 | 6977.78 | 2828.78 | 10452.3 | 6977.78 | 6294.34 | 7064.77 | 4256.39 | 7357.53 | 4266.7 | 6050 | 2130.18 |

r_hat | 1.002 | 1.002 | 1.011 | 1.009 | 1.008 | 1.001 | 1.013 | 1.008 | 1.002 | 1.005 | 1.001 | 1.001 | 1.005 | 1.009 | 1.006 | 1.013 | 1.002 | 1.009 | 1.005 | 1.006 |

#### $\hat{R}$ diagnosticâ€‹

$\hat{R}$ is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then $\hat{R}\approx 1$, since the between-chain and within-chain variance should be equal. $\hat{R}$ is calculated as

where $W$ is the within-chain variance and $\hat{V}$ is the posterior variance estimate
for the pooled rank-traces. The take-away here is that $\hat{R}$ converges towards 1
when each of the Markov chains approaches perfect adaptation to the true posterior
distribution. We do not recommend using inference results if $\hat{R}>1.01$. More
information about $\hat{R}$ can be found in the Vehtari *et al* paper.

#### Effective sample size $ess$ diagnosticâ€‹

MCMC samplers do not draw independent samples from the target distribution, which means
that our samples are correlated. In an ideal situation all samples would be independent,
but we do not have that luxury. We can, however, measure the number of *effectively
independent* samples we draw, which is called the effective sample size. You can read
more about how this value is calculated in the Vehtari *et al* paper,
briefly it is a measure that combines information from the $\hat{R}$ value with the
autocorrelation estimates within the chains. There are many ways to estimate effective
samples sizes, however, we will be using the method defined in the
Vehtari *et al* paper.

The rule of thumb for `ess_bulk`

is for this value to be greater than 100 per chain on
average. Since we ran four chains, we need `ess_bulk`

to be greater than 400 for each
parameter. The `ess_tail`

is an estimate for effectively independent samples considering
the more extreme values of the posterior. This is not the number of samples that landed
in the tails of the posterior. It is a measure of the number of effectively independent
samples if we sampled the tails of the posterior. The rule of thumb for this value is
also to be greater than 100 per chain on average.

We can use `arviz`

to plot the evolution of the effective sample size for each of our
parameters. The below plots show over each successive draw from our model, the estimated
effective sample size for our parameters. The red horizontal dashed line shows the
rule-of-thumb value needed for both the bulk and tail `ess`

estimates. When the model is
converging properly, both the bulk and tail lines should be *roughly* linear.

`# Needed for Colab`

output_notebook(hide_banner=True)

keys = ["Î¼_Î±", "Ïƒ_Î±", "Î²", "Î³", "Ïƒ_y"]

values = [mu_alpha(), sigma_alpha(), beta(), gamma(), sigma_y()]

parameters = dict(zip(keys, values))

plots = []

for title, parameter in parameters.items():

data = {title: samples.get(parameter).numpy()}

f = az.plot_ess(data, kind="evolution", show=False)[0][0]

f.plot_width = 275

f.plot_height = 275

f.y_range.start = 0

f.outline_line_color = "black"

f.grid.grid_line_alpha = 0.2

f.grid.grid_line_color = "grey"

f.grid.grid_line_width = 0.2

plots.append(f)

query_ess = gridplot([plots[:3], plots[3:]])

show(query_ess)

Rather than plot all effective sample size evolutions for each county's $\alpha$ parameter, we will do so for a sample.

`# Needed for Colab`

output_notebook(hide_banner=True)

sample_counties = {

item["county_index"]: item["county"]

for item in df[

df["county"].isin(

[

"LAC QUI PARLE",

"AITKIN",

"KOOCHICHING",

"DOUGLAS",

"CLAY",

"STEARNS",

"RAMSEY",

"ST. LOUIS",

]

)

][["county_index", "county"]]

.drop_duplicates()

.to_dict(orient="records")

}

data = {"Î±": samples.get(alpha()).numpy()}

alpha_ess = az.plot_ess(data, kind="evolution", show=False).reshape(-1)

plots = []

for index, county in sample_counties.items():

f = alpha_ess[index]

f.title.text = f"Î±[{county}]"

f.plot_width = 275

f.plot_height = 275

f.y_range.start = 0

f.outline_line_color = "black"

f.grid.grid_line_alpha = 0.2

f.grid.grid_line_color = "grey"

f.grid.grid_line_width = 0.2

plots.append(f)

sample_counties_ess = gridplot([plots[:3], plots[3:6], plots[6:]])

show(sample_counties_ess)

Combined with our above analysis, we can show the rank plots of our model parameters.
You might be familiar with `trace plots`

that show parameter posterior distributions
along with the "fuzzy caterpillar" plots of sampled values. Rank plots perform the same
function as trace plots in that they show problems in the mixing of the chains. Rank
plots for our model's parameters are shown below.

`# Needed for Colab`

output_notebook()

trace_ranks = radon.plot_trace_ranks(

keys=["Î¼_Î±", "Ïƒ_Î±", "Î²", "Î³", "Ïƒ_y"],

values=[mu_alpha(), sigma_alpha(), beta(), gamma(), sigma_y()],

samples=samples,

)

show(trace_ranks)

Each of the above rank plots shows fairly uniform histograms for each parameter.

Recall that $\beta$ is the coefficient for the `log_activity`

data. When we plotted the
logarithm of radon activity *vs* the floor measurement we saw that there was a decrease
in $log(radon)$ as we moved from the basement to the first floor of a building. From our
initial data analysis, we expected the coefficient to be negative. Plotting the
posterior for $\beta$ (below) we can see that it is always less than zero with a mean
value of -0.64 and the 89% highest density interval between [-0.75, -0.54]. Similarly
we can plot the posterior distribution for the coefficient of $log(uranium)$, which were
our group-level predictors.

Why an 89% highest density interval (HDI)? To prevent you from thinking about a 95% confidence interval. See McElreath for further discussion on this subject.

`# Needed for Colab`

output_notebook()

beta_posterior = az.plot_posterior({"Î²": samples.get(beta()).numpy()}, show=False,)[

0

][0]

gamma_posterior = az.plot_posterior({"Î³": samples.get(gamma()).numpy()}, show=False,)[

0

][0]

beta_gamma = gridplot([[beta_posterior, gamma_posterior]])

show(beta_gamma)

Again we only show a sampling of the $\alpha$ parameters for the different counties in the model.

`# Needed for Colab`

output_notebook()

alphas = samples.get(alpha()).numpy()

data = {

f"Î±[{county}]": alphas[:, :, index] for index, county in sample_counties.items()

}

county_posteriors = gridplot([*az.plot_posterior(data, show=False)])

show(county_posteriors)

We can inspect how the uranium concentration in a county's soil affects the intercept $\alpha$ by plotting our intercept estimate against the logarithm of uranium counts for the county. The whiskers/error-bars for each marker is its standard deviation from the posterior distribution.

`# Needed for Colab`

output_notebook()

uranium = radon.uranium(summary_df, df)

show(uranium)

## Referencesâ€‹

- Gelman's website http://www.stat.columbia.edu/~gelman/arm/examples/radon/
- HillÂ J andÂ GelmanÂ AÂ (2006)Â
**Data Analysis Using Regression and Multilevel/Hierarchical Models**.Â United States:Â Cambridge University Press. ISBN: 9781139460934. - McElreath R (2020)
**Statistical Rethinking: A Bayesian Course with Examples in R and Stan**2nd edition. Chapman and Hall/CRC. doi: 10.1201/9780429029608 - Price PN, Nero AV, Gelman A (1996)
**Bayesian Prediction of Mean Indoor Radon Concentrations for Minnesota Counties**. Health Physics 71(6). doi: 10.1097/00004032-199612000-00009 - Wikipedia (Decay chain) https://en.wikipedia.org/wiki/Decay_chain#Uranium_series
- Vehtari A, Gelman A, Simpson D, Carpenter B, BÃ¼rkner PC (2021)
**Rank-Normalization, Folding, and Localization: An Improved $\hat{R}$ for Assessing Convergence of MCMC (with Discussion)**. Bayesian Analysis 16(2) 667â€“718. doi: 10.1214/20-BA1221.