Skip to main content

Sparse logistic regression

Tutorial: Sparse Logistic Regression

This tutorial demonstrates modeling and running inference on a sparse logistic regression model in Bean Machine. This tutorial showcases the inference techniques in Bean Machine, and applies the model to a public dataset to evaluate performance. It also introduces the @bm.functional decorator, which can be used to deterministically transform random variables which can be convenient for post-processing.

Problem

Logistic regression is a commonly used statistical method that allows us to predict a binary output from a set of independent variables. The sparse logistic regression is a type of logistic regression model which embeds feature selection in classification by adding overall and per-dimension scale factors. It is very applicable when dealing with high-dimensional data, such as classifying credit scores.

Sparse logistic regression is a hierarchical model with a sparse prior. We will use a horseshoe prior, Carvalho CM, Polson NG, Scott JG which induces sparsity using a combination of a global shrinkage scale factor that pushes the posterior mass of most model parameters to zero and a local scale that allows some of them to escape shrinkage.

Prerequisites

We will be using the following packages within this tutorial.

Please import the following code packages for the rest of the code in the tutorial to work.

# Install Bean Machine in Colab if using Colab.
import sys


if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
!pip install beanmachine
import os
import warnings

import arviz as az
import beanmachine.ppl as bm
import numpy as np
import sklearn
import sklearn.model_selection
import torch
import torch.distributions as dist
from beanmachine.ppl.inference.monte_carlo_samples import MonteCarloSamples
from beanmachine.tutorials.utils import plots
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, Span
from bokeh.plotting import show
from IPython.display import Markdown
from sklearn.metrics import RocCurveDisplay

The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for reproducibility.

# Eliminate excess UserWarnings from ArviZ.
warnings.filterwarnings("ignore")

# Plotting settings
az.rcParams["plot.backend"] = "bokeh"
az.rcParams["stats.hdi_prob"] = 0.89

# Manual seed
bm.seed(17)

# Other settings for the notebook
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ

Model

We have the following definitions.

  • NN: Size of the dataset.
  • DD: Number of features of the dataset.
  • τ\tau: Global shrinkage of the model (input from the user).
  • βd\beta_d: Coefficient corresponding to dimension dDd\in D.
  • λd\lambda_d: Local shrinkage for the coefficient corresponding to dimension dDd\in D.

The model is defined mathematically as follows:

  • λdiidHalfCauchy(0,1)\lambda_d\stackrel{iid}{\sim}\text{HalfCauchy}(0,1)
  • βdiidN(0,τλ)\beta_d\stackrel{iid}{\sim}\mathcal{N}(0,\tau\lambda)
  • yniidBernoulli(σ(XTβ))y_n\stackrel{iid}{\sim}\text{Bernoulli}(\sigma({X}^\textsf{T}\beta))

A few notes:

  • σ(s)=11+es\sigma(s)=\frac{1}{1+e^{-s}} is the logistic function. Its purpose is to translate an unconstrained score s(,)s\in(-\infty,\infty) predicted by the model into a probability σ(s)(0,1)\sigma(s)\in(0,1).
  • τ\tau can be an input from the user which depends on the expected number of non-zeros coefficients in the model. Alternatively, we can have a full Bayesian treatment of τ\tau as suggested by Piironen and Vehtari. For simplicity, this is simply a constant in this tutorial but this can be replaced with a HalfNormal prior with the same scale as τ\tau. This parameter is responsible for global shrinkage, whereas λd\lambda_d tends to shrink the influence of βd\beta_d but since the Cauchy distribution has fat tails, it can also help coefficients escape shrinkage.

We can implement this model in Bean Machine by defining random variable objects with the @bm.random_variable decorator. These functions behave differently than ordinary Python functions.

Semantics for @bm.random_variable functions:
  • They must return PyTorch Distribution objects.
  • Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
  • Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
  • Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
  • Please see the documentation for more information about this decorator.
Semantics for @bm.functional:
  • This decorator is used to deterministically transform the results of one or more random variables.

To implement sparse logistic regression model in Bean Machine, we provide @bm.random_variable definitions for λ\lambda, β\beta, and yy. The value of τ\tau is estimated from the number of expected non-zero coefficients and is an input from the user.

This is all you have to do to define the model. However, we'll also make use of the @bm.functional decorator to make it very easy to store the predictions (rather than the binary outcome, we'll store the probability score of being labeled 1) on the test data and for computing the log likelihood. This decorator has the same semantics as @bm.random_variable, except that it does not return a distribution. Instead, it returns a deterministically-computed function from other random variables. It can be used to conveniently compute values that would typically be computed in a post-processing pass. Here, we use it to compute the log probability of test data, using inferences made on training data.

class SparseLogisticRegression:
def __init__(self, X_train, X_test, Y_test, nonzero_frac=0.3):
self.X_train = X_train
self.X_test = X_test
self.Y_test = Y_test
# See: Piironen and Vehtari
self.tau = (
2 * nonzero_frac / ((1 - nonzero_frac) * np.sqrt(self.X_train.shape[1]))
)

@bm.random_variable
def lambda_(self):
return dist.HalfCauchy(1.0).expand([self.X_train.shape[1], 1])

@bm.random_variable
def eps(self):
return dist.Normal(0, 1).expand([self.X_train.shape[1], 1])

@bm.random_variable
def y(self):
return dist.Bernoulli(logits=self.X_train @ self.beta())

@bm.functional
def beta(self):
return self.tau * self.lambda_() * self.eps()

@bm.functional
def log_prob_test(self):
return (
dist.Bernoulli(logits=self.X_test @ self.beta()).log_prob(self.Y_test).sum()
)

@bm.functional
def y_probs(self):
return torch.sigmoid(self.X_test @ self.beta())

def __repr__(self):
return f"SparseLogisticRegression with {self.X_train.shape[1]} covariates"

Data

With the model defined, we need to collect some observed data in order to learn about values of interest in our model. For this tutorial, we'll use 1,000 five-dimensional data points where the response only depends on the first two, so that we can visualize what's going on.

We will generate a dataset where items have true label 0 and true label 1. For demonstrative purposes, we will use a synthetically generated dataset of observed values. In practice, you would gather real data and classify results by-hand, for example using human labelers.

X = dist.Normal(0, 5).expand([1000, 5]).sample()

tau0 = 2 * 0.3 / (0.7 * np.sqrt(1000))
lambda0 = torch.tensor([-70, 40, 0, 0, 0]).unsqueeze(-1)
true_beta = tau0 * lambda0

Y = dist.Bernoulli(logits=X @ true_beta).sample()

print(f"τ0: {tau0}")
print(f" β: {true_beta}")
Out:

τ0: 0.02710523708715754

β: tensor([[-1.8974],

[ 1.0842],

[ 0.0000],

[ 0.0000],

[ 0.0000]])

Let's take a moment to visualize our dataset.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

red_x = X[:, :2][Y[:, 0] == 0, 0]
red_y = X[:, :2][Y[:, 0] == 0, 1]
red_label = ["red"] * len(red_x)
red_cds = ColumnDataSource(
{"x": red_x.tolist(), "y": red_y.tolist(), "label": red_label}
)
red_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]
green_x = X[:, :2][Y[:, 0] == 1, 0]
green_y = X[:, :2][Y[:, 0] == 1, 1]
green_label = ["green"] * len(green_x)
green_cds = ColumnDataSource(
{"x": green_x.tolist(), "y": green_y.tolist(), "label": green_label}
)
green_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]

synthetic_data_plot = plots.scatter_plot(
plot_sources=[red_cds, green_cds],
tooltips=[red_tips, green_tips],
figure_kwargs={
"title": "Synthetic data",
"x_axis_label": "x",
"y_axis_label": "y",
},
legend_items=["Label 0", "Label 1"],
plot_kwargs={"fill_color": "label", "fill_alpha": 0.5},
)

show(synthetic_data_plot)
loading...

Now, we will split the dataset into a training set and a test set.

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y)

Now that we've got our data defined, we can instantiate an instance of the model.

model = SparseLogisticRegression(X_train, X_test, Y_test)

Our inference algorithms expect observations in the form of a dictionary. This dictionary should consist of @bm.random_variable invocations as keys, and tensor data as values. In order to bind this data, first we'll instantiate the model.

observations = {model.y(): Y_train}

Inference

Inference is the process of combining model with data to obtain insights, in the form of probability distributions over values of interest. Bean Machine offers a powerful and general inference framework to enable fitting arbitrary models to data.

Since this model is comprised entirely of differentiable random variables, we'll make use of the No U-Turn Sampler (NUTS) (Hoffman & Gelman, 2011).

Running inference consists of a few arguments:

NameUsage
queriesList of @bm.random_variable targets to fit posterior distributions for.
observationsA dictionary of observations, as built above.
num_samplesNumber of Monte Carlo samples to approximate the posterior distributions for the variables in queries.
num_chainsNumber of separate inference runs to use. Multiple chains can help verify that inference ran correctly.

Let's run inference:

num_samples = 2 if smoke_test else 1000
num_adaptive_samples = 0 if smoke_test else num_samples // 2
num_chains = 1 if smoke_test else 2
# Experimental backend using the Pytorch NNC compiler
nnc_compile = "SANDCASTLE_NEXUS" not in os.environ
queries = [model.lambda_(), model.beta(), model.log_prob_test(), model.y_probs()]

samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(
queries=queries,
observations=observations,
num_adaptive_samples=num_adaptive_samples,
num_samples=num_samples,
num_chains=num_chains,
)
Out:

Samples collected: 0%| | 0/1500 [00:00<?, ?it/s]

Samples collected: 0%| | 0/1500 [00:00<?, ?it/s]

Analysis

samples now contains our inference results.

First, we'll just print previews of the results. This should give a sense of how to work with the samples object, and also an idea of the shapes of the inferred values.

lambda_marginal = samples[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()
beta_marginal = samples[model.beta()].flatten(start_dim=0, end_dim=1).detach()
log_prob_test_results = samples[model.log_prob_test()][0].detach()

print(
f"lambda_marginal:\n{lambda_marginal}\n\n"
f"beta_marginal:\n{beta_marginal}\n\n"
f"log_prob_test_results:\n{log_prob_test_results[:20]}"
)
Out:

lambda_marginal:

tensor([[[2.5968e+00],

[4.3340e+00],

[3.2614e-01],

[6.6289e-01],

[2.0303e-01]],

[[4.0693e+00],

[5.4920e+00],

[6.1293e-01],

[6.7453e-01],

[7.8361e-01]],

[[3.3681e+00],

[5.0646e+00],

[6.6694e-01],

[4.4641e-01],

[1.0607e+00]],

...,

[[4.2111e+00],

[7.1516e+00],

[8.1101e-01],

[3.8489e-02],

[5.8832e-01]],

[[7.4676e+00],

[2.9920e+00],

[1.5182e-01],

[6.4247e-03],

[4.9245e-01]],

[[2.7034e+00],

[3.3743e+00],

[8.3389e-02],

[2.8964e-02],

[6.7897e-01]]])

beta_marginal:

tensor([[[-1.6024],

[ 0.9802],

[ 0.0034],

[ 0.0079],

[-0.0106]],

[[-2.1407],

[ 1.3242],

[ 0.0538],

[ 0.0720],

[-0.0625]],

[[-1.6937],

[ 1.0745],

[ 0.0660],

[ 0.0489],

[-0.0571]],

...,

[[-1.7365],

[ 1.1237],

[ 0.0771],

[-0.0349],

[-0.0976]],

[[-2.0167],

[ 1.2773],

[ 0.0269],

[-0.0059],

[-0.0655]],

[[-1.9680],

[ 1.1761],

[-0.0377],

[-0.0356],

[-0.0355]]])

log_prob_test_results:

tensor([-29.5386, -33.0372, -32.8408, -32.0504, -32.3691, -33.4062, -30.8081,

-30.1592, -31.7247, -30.3864, -30.9796, -31.5982, -31.6301, -33.7440,

-34.1124, -31.4896, -33.1069, -31.1966, -34.1167, -33.6170])

Next, let's visualize the inferred random variables.

Below we plot the joint of the first two components of β\beta and λ\lambda (as a reminder, the remaining components are noise which the outcome variable doesn't depend on).

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

beta_marginals_plot = plots.marginal_2d(
x=beta_marginal[:, :2][:, 0].flatten().numpy(),
y=beta_marginal[:, :2][:, 1].flatten().numpy(),
x_label="β0",
y_label="β1",
title="β0 - β1 joint plot",
true_x=float(true_beta[:2, :].flatten()[0]),
true_y=float(true_beta[:2, :].flatten()[1]),
)
show(beta_marginals_plot)
loading...
# Required for visualizing in Colab.
output_notebook(hide_banner=True)

lambda_marginals_plot = plots.marginal_2d(
x=lambda_marginal[:, :2][:, 0].flatten().numpy(),
y=lambda_marginal[:, :2][:, 1].flatten().numpy(),
x_label="λ0",
y_label="λ1",
title="λ0 - λ1 joint plot",
)
show(lambda_marginals_plot)
loading...

The marginals and the joint for β\beta and λ\lambda look reasonable. We note that posterior mean for β\beta closely matches the true values used to generate the data. Note that the λ\lambda values can be fairly large if needed since we place a HalfCauchy prior over it. Even in the presence of strong global shrinkage, this lets local parameters take on larger values if needed.

Now let us look at the histogram for the remaining values of β\beta, i.e. the coefficients for variables which have no effect on the outcome.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

betas = samples[model.beta()].squeeze(-1).reshape(-1, 5)
plot_sources = []
labels = []
tooltips = []
for i in range(5):
b = betas[:, i].numpy()
support, density = az.stats.density_utils.kde(b)
density /= density.max()
cds = ColumnDataSource({"x": support.tolist(), "y": density.tolist()})
label = f"β{i}"
tips = [(f"{label}", "@support{0.000}")]
plot_sources.append(cds)
labels.append(label)
tooltips.append(tips)

betas_density_plot = plots.line_plot(
plot_sources=plot_sources,
labels=labels,
tooltips=tooltips,
plot_kwargs={"line_width": 2},
figure_kwargs={
"title": "Histogram of some sample model coefficients",
"x_axis_label": "β",
},
)
betas_density_plot.yaxis.visible = False
betas_density_plot.legend.click_policy = "mute"
show(betas_density_plot)
loading...

We note that the majority of the posterior mass on all the variables that the outcome doesn't depend on is centered at 0 and the only non-zero values are the first two variables. This validates how the horseshoe prior helps induce sparsity.

Lastly, let's plot the first thousand log probabilities on test data that we generated per-iteration during inference. We'll also overlay the log probability that the test dataset would score on the ground truth parameters.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

ground_truth_log_prob = float(
dist.Bernoulli(logits=X_test @ true_beta).log_prob(Y_test).sum()
)
y = log_prob_test_results.numpy()
x = np.arange(len(y))
cds = ColumnDataSource({"x": x.tolist(), "y": y.tolist()})
tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]
log_prob_plot = plots.line_plot(
plot_sources=[cds],
labels=[f"Using true params = {ground_truth_log_prob :.2f}"],
tooltips=[tips],
figure_kwargs={
"title": "Log probability",
"plot_width": 800,
"x_axis_label": "Iteration",
"y_axis_label": "Log probability",
},
plot_kwargs={"line_alpha": 0.5},
)
# Add a line showing the ground truth.
span = Span(
location=ground_truth_log_prob,
dimension="width",
line_color="black",
line_width=3,
)
log_prob_plot.add_layout(span)
log_prob_plot.legend.location = "bottom_left"
show(log_prob_plot)
loading...

Let us plot the predictions from the model on the test dataset as a sanity check. For simplicity, we will compute the mean of the probability score (computed as y_probs in the model) on each test data point and assume a label of 0 if the score is less than 0.5 and conversely a label of 1 if the score is above 0.5.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

y_probs = samples[model.y_probs()].mean([0, 1])
predicted_red_x = X_test[y_probs[:, 0] < 0.5, 0]
predicted_red_y = X_test[y_probs[:, 0] < 0.5, 1]
predicted_red_label = ["red"] * len(predicted_red_x)
predicted_red_cds = ColumnDataSource(
{
"x": predicted_red_x.tolist(),
"y": predicted_red_y.tolist(),
"label": predicted_red_label,
}
)
predicted_red_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]
predicted_green_x = X_test[y_probs[:, 0] >= 0.5, 0]
predicted_green_y = X_test[y_probs[:, 0] >= 0.5, 1]
predicted_green_label = ["green"] * len(predicted_green_x)
predicted_green_cds = ColumnDataSource(
{
"x": predicted_green_x.tolist(),
"y": predicted_green_y.tolist(),
"label": predicted_green_label,
}
)
predicted_green_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]
predicted_data_plot = plots.scatter_plot(
plot_sources=[predicted_red_cds, predicted_green_cds],
tooltips=[predicted_red_tips, predicted_green_tips],
figure_kwargs={
"title": "Synthetic data",
"x_axis_label": "x",
"y_axis_label": "y",
},
legend_items=["Predicted 0", "Predicted 1"],
plot_kwargs={"fill_color": "label", "fill_alpha": 0.5},
)
# Select one point and annotate it.
predicted_data_plot.circle(
x=[float(X_test[11, 0])],
y=[float(X_test[11, 1])],
size=20,
fill_color=None,
line_color="steelblue",
line_width=2,
line_alpha=1,
)
predicted_data_plot.legend.location = "bottom_left"
show(predicted_data_plot)
loading...

As we can see, the model seems to have recovered the same decision boundary, and MCMC has converged to parameters that correctly predicted the test dataset. This was just a sanity check. Since we have built a probabilistic model, the probability that each data point has label 1 is itself a distribution! Let us see this for one of the test data points (marked by a blue circle above).

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

data_point11_plot = plots.histogram_plot(
data=samples[model.y_probs()][..., 11, 0].reshape(-1).numpy(),
figure_kwargs={"title": "Probability of test datapoint 11 having Label 1"},
)
show(data_point11_plot)
loading...

The MonteCarloSamples from Bean Machine's inference can be converted to arviz's internal InferenceData format so that we can make use of its inference statistics and plotting utilities for diagnostics.

filtered_samples = {
k: v for k, v in samples.items() if k in {model.beta(), model.lambda_()}
}
az_data = MonteCarloSamples(filtered_samples).to_inference_data()
summary_df = az.summary(az_data, round_to=3)
Markdown(summary_df.to_markdown())
meansdhdi_5.5%hdi_94.5%mcse_meanmcse_sdess_bulkess_tailr_hat
beta(SparseLogisticRegression with 5 covariates,)[0, 0]-1.8340.195-2.112-1.5060.0040.0031928.091492.710.999
beta(SparseLogisticRegression with 5 covariates,)[1, 0]1.1240.1210.9391.3140.0030.0021942.431465.611
beta(SparseLogisticRegression with 5 covariates,)[2, 0]0.020.03-0.0270.0680.0010.0011384.371871.511.002
beta(SparseLogisticRegression with 5 covariates,)[3, 0]0.0350.033-0.010.090.0010.001826.0911277.331.001
beta(SparseLogisticRegression with 5 covariates,)[4, 0]-0.0550.037-0.1140.0040.0010.0011291.371073.181
lambda_(SparseLogisticRegression with 5 covariates,)[0, 0]5.2033.8941.5029.2010.1490.105750.873843.6181.002
lambda_(SparseLogisticRegression with 5 covariates,)[1, 0]3.4532.6290.8566.0090.1190.084634.313619.9521.004
lambda_(SparseLogisticRegression with 5 covariates,)[2, 0]0.4990.6930.0011.1370.0240.017512.808721.7961.003
lambda_(SparseLogisticRegression with 5 covariates,)[3, 0]0.631.3601.2650.0480.034413.859566.3791.002
lambda_(SparseLogisticRegression with 5 covariates,)[4, 0]0.6410.85901.2970.0270.019595.7732.4011.002

The summary above includes useful statistics about each marginal distribution. Below we describe what some of these useful statistics are.

R^\hat{R} diagnostic

R^\hat{R} is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then R^1\hat{R}\approx 1, since the between-chain and within-chain variance should be equal. R^\hat{R} is calculated as

R^=V^W\hat{R}=\frac{\hat{V}}{W}

where WW is the within-chain variance and V^\hat{V} is the posterior variance estimate for the pooled rank-traces. The take-away here is that R^\hat{R} converges towards 1 when each of the Markov chains approaches perfect adaptation to the true posterior distribution. We do not recommend using inference results if R^>1.01\hat{R}>1.01. More information about R^\hat{R} can be found in the Vehtari et al paper.

Effective sample size essess diagnostic

MCMC samplers do not draw independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper, briefly it is a measure that combines information from the R^\hat{R} value with the autocorrelation estimates within the chains. There are many ways to estimate effective samples sizes, however, we will be using the method defined in the Vehtari et al paper.

The rule of thumb for ess_bulk is for this value to be greater than 100 per chain on average. Since we ran four chains, we need ess_bulk to be greater than 200 for each parameter. The ess_tail is an estimate for effectively independent samples considering the more extreme values of the posterior. This is not the number of samples that landed in the tails of the posterior. It is a measure of the number of effectively independent samples if we sampled the tails of the posterior. The rule of thumb for this value is also to be greater than 100 per chain on average.

For comparison, let's check out the model's performance using a couple of different inference methods.

Single-site Metropolis-Hastings

Let's retry this problem, using ancestral Metropolis-Hastings as the inference algorithm to compare performance.

Ancestral Metropolis-Hastings is a simple inference algorithm, which proposes child random variables conditional on values for the parent random variables. The most ancestral random variables are simply sampled from the prior distribution.

queries_mh = [model.lambda_(), model.beta(), model.log_prob_test()]
num_samples_mh = 2 * num_samples

samples_mh = bm.SingleSiteAncestralMetropolisHastings().infer(
queries=queries_mh,
observations=observations,
num_samples=num_samples_mh,
num_chains=num_chains,
)
Out:

Samples collected: 0%| | 0/2000 [00:00<?, ?it/s]

Samples collected: 0%| | 0/2000 [00:00<?, ?it/s]

lambda_marginal_mh = (
samples_mh[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()
)
beta_marginal_mh = samples_mh[model.beta()].flatten(start_dim=0, end_dim=1).detach()
log_prob_test_results_mh = (
samples_mh[model.log_prob_test()].flatten(start_dim=0, end_dim=1).detach()
)
# Required for visualizing in Colab.
output_notebook(hide_banner=True)

beta_marginals_mh_plot = plots.marginal_2d(
x=beta_marginal_mh[:, :2][:, 0].flatten().numpy(),
y=beta_marginal_mh[:, :2][:, 1].flatten().numpy(),
x_label="β0",
y_label="β1",
title="β0 - β1 joint plot / Single Site Ancestral Metropolis Hastings",
true_x=float(true_beta[:2, :].flatten()[0]),
true_y=float(true_beta[:2, :].flatten()[1]),
)
show(beta_marginals_mh_plot)
loading...
# Required for visualizing in Colab.
output_notebook(hide_banner=True)

lambda_marginals_mh_plot = plots.marginal_2d(
x=lambda_marginal_mh[:, :2][:, 0].flatten().numpy(),
y=lambda_marginal_mh[:, :2][:, 1].flatten().numpy(),
x_label="λ0",
y_label="λ1",
title="λ0 - λ1 joint plot / Single Site Ancestral Metropolis Hastings",
)
show(lambda_marginals_mh_plot)
loading...

From all of the above plots, we see that ancestral Metropolis-Hastings does a significantly worse job at recovering the true parameters! Not only do regions of uncertainty tend to exclude the true values, the samples that are actually drawn are very sparse. This means that the algorithm is achieving a very poor representation of the posterior surface.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

y = log_prob_test_results_mh.numpy()[:1000]
x = np.arange(len(y))
cds = ColumnDataSource({"x": x.tolist(), "y": y.tolist()})
tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]
log_prob_mh_plot = plots.line_plot(
plot_sources=[cds],
labels=[f"Using true params = {ground_truth_log_prob :.2f}"],
tooltips=[tips],
figure_kwargs={
"title": "Log probability",
"plot_width": 800,
"x_axis_label": "Iteration",
"y_axis_label": "Log probability",
},
plot_kwargs={"line_alpha": 0.5},
)
# Add a line showing the ground truth.
span = Span(
location=ground_truth_log_prob,
dimension="width",
line_color="black",
line_width=3,
)
log_prob_mh_plot.add_layout(span)
log_prob_mh_plot.legend.location = "bottom_left"
show(log_prob_mh_plot)
loading...

Surprisingly, the algorithm does eventually discover — and settle in on — parameters that seem to describe the test data well.

summary_mh_df = az.summary(samples_mh.to_inference_data(), round_to=3)
Markdown(summary_mh_df.to_markdown())
meansdhdi_5.5%hdi_94.5%mcse_meanmcse_sdess_bulkess_tailr_hat
log_prob_test(SparseLogisticRegression with 5 covariates,)-34.39118.654-37.696-30.9671.7731.2574.10414.8361.517
beta(SparseLogisticRegression with 5 covariates,)[0, 0]-1.6020.23-1.785-1.1970.0950.0813.7352.3121.485
beta(SparseLogisticRegression with 5 covariates,)[1, 0]0.9890.1880.7251.1520.0960.083.3269.7391.604
beta(SparseLogisticRegression with 5 covariates,)[2, 0]00.111-0.1170.1150.0740.0612.3132.3122.825
beta(SparseLogisticRegression with 5 covariates,)[3, 0]0.0410.0450.0170.1130.0110.0086.9092.8361.477
beta(SparseLogisticRegression with 5 covariates,)[4, 0]-0.0040.09-0.1010.0750.0590.0482.6262.3232.682
lambda_(SparseLogisticRegression with 5 covariates,)[0, 0]2.3890.6741.7543.1410.4570.3812.2242.023.968
lambda_(SparseLogisticRegression with 5 covariates,)[1, 0]4.0160.5593.9094.3770.2980.2552.2897.6132.915
lambda_(SparseLogisticRegression with 5 covariates,)[2, 0]0.1850.0790.1760.2170.0110.0084.4948.6932.887
lambda_(SparseLogisticRegression with 5 covariates,)[3, 0]1.0851.6010.3151.0490.7110.5352.2612.0083.069
lambda_(SparseLogisticRegression with 5 covariates,)[4, 0]0.4390.3140.1490.6960.2090.1732.2842.0082.941

R^\hat{R} values are extremely far from 1.01.0. Further, the essess and values for our sampled random variables are extremely small. As a result, these inference results would be unusable for any real application.

In comparison, NUTS seems to have developed a much more complete representation of the posterior surface. Now that we have validated the model on a synthetic dataset, we will look at a real-world example.

German-numeric data

In the above examples, we used a 2D dataset, simply to make visualizing the inferences more intuitive. Let's retry our sparse logistic regression on a real-world dataset, the German credit dataset.

The German credit dataset is a collection of 1,000 data points. Each data point represents a person who borrows from a bank. Each person is classified as either a good or a bad credit risk according to the bank. Each person contains 24 numeric covariates that may or may not be useful predictors for credit risk. Example covariates include age, sex, and savings. The response variable is either 1, indicating good credit, or 2, indicating bad credit. You can read more about this dataset, and in particular what each of the first 24 covariate columns represent in the documentation.

input_data = np.genfromtxt(
"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data-numeric"
)

Let's scale covariates to range between -1 and 1, and add a constant factor. This yields a total of 25 covariates. We'll also translate the response variable to [0, 1].

X_all = torch.from_numpy(
np.hstack(
[
np.ones((input_data.shape[0], 1)),
sklearn.preprocessing.minmax_scale(
input_data[:, :-1],
feature_range=(-1, 1),
),
]
)
).float()
Y_all = torch.from_numpy(sklearn.preprocessing.minmax_scale(input_data[:, -1:])).float()

Now, we'll split into training and test data.

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(
X_all,
Y_all,
)

The rest of the setup is the same as the tutorial with synthetic data.

First, we'll instantiate our model.

model = SparseLogisticRegression(X_train, X_test, Y_test)

Next, we'll bind our observations.

observations = {model.y(): Y_train}

Now, we're ready to fit the model.

german_queries = [
model.lambda_(),
model.beta(),
model.log_prob_test(),
model.y_probs(),
]
german_samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(
queries=german_queries,
observations=observations,
num_adaptive_samples=num_adaptive_samples,
num_samples=num_samples,
num_chains=num_chains,
)
Out:

Samples collected: 0%| | 0/1500 [00:00<?, ?it/s]

Samples collected: 0%| | 0/1500 [00:00<?, ?it/s]

german_lambda_marginal = (
german_samples[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()
)
german_beta_marginal = (
german_samples[model.beta()].flatten(start_dim=0, end_dim=1).detach()
)
german_log_prob_test_results = (
german_samples[model.log_prob_test()].flatten(start_dim=0, end_dim=1).detach()
)

This model is too high-dimensional to plot the same way that we did with the previous examples. Let us look at the log density of the samples generated by MCMC. Sometimes the chains can get stuck making little progress, which can be seen from this plot. We observe that NUTS is able to draw samples quite efficiently from across the posterior distribution (compare with Metropolis-Hastings above).

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

german_y = german_log_prob_test_results[:1000]
german_x = np.arange(len(y))
german_cds = ColumnDataSource({"x": german_x.tolist(), "y": german_y.tolist()})
german_tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]
german_log_prob_plot = plots.line_plot(
plot_sources=[german_cds],
tooltips=[german_tips],
figure_kwargs={
"title": "Log probability for German credit risk data",
"plot_width": 800,
"x_axis_label": "Iteration",
"y_axis_label": "Log probability",
},
plot_kwargs={"line_alpha": 0.5},
)
show(german_log_prob_plot)
loading...

Let us also look at the diagnostics to ensure that the MCMC chains have converged to the posterior distribution. We note that R^\hat{R} are all close to 1 and the essess values look reasonable.

filtered_samples = {
k: v for k, v in german_samples.items() if k in {model.beta(), model.lambda_()}
}
az_data = MonteCarloSamples(filtered_samples).to_inference_data()
german_summary_df = az.summary(az_data, round_to=3)
Markdown(german_summary_df.to_markdown())
meansdhdi_5.5%hdi_94.5%mcse_meanmcse_sdess_bulkess_tailr_hat
beta(SparseLogisticRegression with 25 covariates,)[0, 0]-0.2180.356-0.7560.230.0120.008953.991267.421.003
beta(SparseLogisticRegression with 25 covariates,)[1, 0]-0.7820.12-0.965-0.5880.0030.0021616.331357.341.001
beta(SparseLogisticRegression with 25 covariates,)[2, 0]1.3470.2990.8671.8190.0070.0051915.531591.961.002
beta(SparseLogisticRegression with 25 covariates,)[3, 0]-0.4750.205-0.799-0.1560.0060.005984.341782.1781.001
beta(SparseLogisticRegression with 25 covariates,)[4, 0]0.0570.189-0.1890.3860.0040.0031863.571583.561.001
beta(SparseLogisticRegression with 25 covariates,)[5, 0]-0.2910.144-0.537-0.070.0040.0031077.02591.5821.003
beta(SparseLogisticRegression with 25 covariates,)[6, 0]-0.3480.179-0.5870.0060.0070.005785.977456.2591
beta(SparseLogisticRegression with 25 covariates,)[7, 0]-0.1230.158-0.3580.1060.0040.0031294.971653.241.001
beta(SparseLogisticRegression with 25 covariates,)[8, 0]-0.0110.083-0.1550.1080.0020.0011682.911639.81
beta(SparseLogisticRegression with 25 covariates,)[9, 0]0.1830.154-0.0360.4050.0050.003894.5531343.351.001
beta(SparseLogisticRegression with 25 covariates,)[10, 0]-0.0680.147-0.3480.1060.0040.0031349.611360.981
beta(SparseLogisticRegression with 25 covariates,)[11, 0]-0.2140.128-0.3910.0110.0040.0031046.93507.061.005
beta(SparseLogisticRegression with 25 covariates,)[12, 0]0.1110.174-0.1140.3910.0040.0031545.61841.11.001
beta(SparseLogisticRegression with 25 covariates,)[13, 0]0.0130.079-0.1070.1450.0020.0011694.561724.961.002
beta(SparseLogisticRegression with 25 covariates,)[14, 0]-0.0880.089-0.2330.0340.0030.0021096.821181.231.002
beta(SparseLogisticRegression with 25 covariates,)[15, 0]-0.2750.294-0.7270.0730.0090.007832.8491529.491.007
beta(SparseLogisticRegression with 25 covariates,)[16, 0]0.2430.1230.0470.4460.0050.003686.886199.9851.003
beta(SparseLogisticRegression with 25 covariates,)[17, 0]-0.310.202-0.5940.0180.0070.005717.282510.9551.002
beta(SparseLogisticRegression with 25 covariates,)[18, 0]0.2210.212-0.0430.560.0070.005629.2291192.161.009
beta(SparseLogisticRegression with 25 covariates,)[19, 0]0.2630.285-0.1040.6890.0090.006781.1961412.121.008
beta(SparseLogisticRegression with 25 covariates,)[20, 0]0.1510.138-0.0560.3590.0040.0031081.011508.051
beta(SparseLogisticRegression with 25 covariates,)[21, 0]-0.0620.108-0.2530.0820.0030.002929.6261422.191.002
beta(SparseLogisticRegression with 25 covariates,)[22, 0]0.0150.14-0.1920.2490.0040.0031312.97809.9341.003
beta(SparseLogisticRegression with 25 covariates,)[23, 0]0.0210.086-0.1170.1530.0020.0011606.221499.371.001
beta(SparseLogisticRegression with 25 covariates,)[24, 0]-0.0160.072-0.1310.10.0020.002886.206968.0571.001
lambda_(SparseLogisticRegression with 25 covariates,)[0, 0]2.1012.94304.7950.0980.071039.311018.091.002
lambda_(SparseLogisticRegression with 25 covariates,)[1, 0]5.3074.3081.2699.5950.1580.112726.059940.2541
lambda_(SparseLogisticRegression with 25 covariates,)[2, 0]8.7857.2651.75616.1590.2580.183627.7591031.121.004
lambda_(SparseLogisticRegression with 25 covariates,)[3, 0]3.6324.0560.2767.0880.1310.093695.994859.2731.006
lambda_(SparseLogisticRegression with 25 covariates,)[4, 0]1.2941.8450.0012.7950.0440.0331544.61963.1831.003
lambda_(SparseLogisticRegression with 25 covariates,)[5, 0]2.5923.2260.0054.6820.120.085768.199695.6691
lambda_(SparseLogisticRegression with 25 covariates,)[6, 0]2.8363.3570.0035.1790.1170.083603.47590.5620.999
lambda_(SparseLogisticRegression with 25 covariates,)[7, 0]1.4152.1020.0052.8450.060.046912.9851057.831.006
lambda_(SparseLogisticRegression with 25 covariates,)[8, 0]0.9511.4720.0021.9420.0720.051842.951239.8971.003
lambda_(SparseLogisticRegression with 25 covariates,)[9, 0]1.7322.2880.0013.3450.070.049692.473735.361.003
lambda_(SparseLogisticRegression with 25 covariates,)[10, 0]1.3212.2680.0012.6050.0660.0471168.431075.631.001
lambda_(SparseLogisticRegression with 25 covariates,)[11, 0]1.7971.9080.0153.3920.060.042667.351630.8211.002
lambda_(SparseLogisticRegression with 25 covariates,)[12, 0]1.4172.0740.0032.9410.0590.0451363.81021.441.005
lambda_(SparseLogisticRegression with 25 covariates,)[13, 0]0.911.23801.850.0330.0231213.41759.3931.005
lambda_(SparseLogisticRegression with 25 covariates,)[14, 0]1.1611.6070.0012.3310.0470.033912.611700.1831.001
lambda_(SparseLogisticRegression with 25 covariates,)[15, 0]2.2452.4970.0015.2060.0720.051844.69724.2331.007
lambda_(SparseLogisticRegression with 25 covariates,)[16, 0]2.2092.7140.0234.0230.0790.056952.933671.0741.002
lambda_(SparseLogisticRegression with 25 covariates,)[17, 0]2.5082.7870.0034.8570.1030.073544.242342.131.002
lambda_(SparseLogisticRegression with 25 covariates,)[18, 0]2.112.6970.0074.1870.0790.056653.669699.5921.008
lambda_(SparseLogisticRegression with 25 covariates,)[19, 0]2.5594.1160.0014.9160.4180.383279.064212.1121.014
lambda_(SparseLogisticRegression with 25 covariates,)[20, 0]1.5681.8940.0023.0970.060.042661.574711.9951.002
lambda_(SparseLogisticRegression with 25 covariates,)[21, 0]1.0531.3290.0022.090.0360.0251098.9798.2291.002
lambda_(SparseLogisticRegression with 25 covariates,)[22, 0]1.181.7060.0032.3940.0560.0431126.21024.511.003
lambda_(SparseLogisticRegression with 25 covariates,)[23, 0]0.9171.43801.870.0420.032958.948792.4671.003
lambda_(SparseLogisticRegression with 25 covariates,)[24, 0]0.861.3690.0021.750.0440.0311157.461089.541.002

To check for sparsity, let us plot the histogram of the marginals for a selection of coefficients (β\beta) in the model. We note that many of the parameters have their posterior probability mass concentrated at 0, resulting in a sparse model, as we desired.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

german_betas = german_samples[model.beta()].squeeze(-1).reshape(-1, 25)
german_plot_sources = []
german_labels = []
german_tooltips = []
for i in range(25):
b = german_betas[:, i].numpy()
support, density = az.stats.density_utils.kde(b)
density /= density.max()
cds = ColumnDataSource({"x": support.tolist(), "y": density.tolist()})
label = f"β{i}"
tips = [(f"{label}", "@support{0.000}")]
german_plot_sources.append(cds)
german_labels.append(label)
german_tooltips.append(tips)

german_betas_density_plot = plots.line_plot(
plot_sources=german_plot_sources,
labels=german_labels,
tooltips=german_tooltips,
plot_kwargs={"line_width": 1, "line_alpha": 0.4},
figure_kwargs={
"title": "Histogram of some sample model coefficients",
"x_axis_label": "β",
"plot_height": 700,
},
)
german_betas_density_plot.yaxis.visible = False
german_betas_density_plot.legend.click_policy = "mute"
show(german_betas_density_plot)
loading...

Finally, let us evaluate the mean predictions from the model using the ROC (Receiver Operating Characteristic) plot. The area under the curve (AUC) is 0.83 which shows that the model is able to achieve a good classification accuracy. This is merely evaluating the mean forecast from the model, but as we saw earlier, we have a full distribution of the probability scores corresponding to each test data.

y_true = Y_test.squeeze(-1).numpy()
y_pred_mean = german_samples[model.y_probs()].squeeze(-1).mean([0, 1]).numpy()
false_positive_rate, true_positive_rate, _ = sklearn.metrics._ranking.roc_curve(
y_true,
y_pred_mean,
)
roc_score = sklearn.metrics.roc_auc_score(y_true, y_pred_mean)
roc_cds = ColumnDataSource(
{
"x": false_positive_rate.tolist(),
"y": true_positive_rate.tolist(),
}
)
roc_tips = [("TPF", "@y{0.000}"), ("FPR", "@x{0.000}")]
roc_plot = plots.line_plot(
plot_sources=[roc_cds],
tooltips=[roc_tips],
labels=[f"Classifier (AUC = {roc_score:.2f})"],
figure_kwargs={
"title": "Receiver Operator Characteristic",
"x_axis_label": "False positive rate (positive label: 1)",
"y_axis_label": "True positive rate (positive label: 1)",
},
plot_kwargs={
"line_width": 3,
"line_alpha": 0.7,
"hover_line_color": "orange",
"hover_line_alpha": 1,
},
)
roc_plot.legend.location = "bottom_right"
show(roc_plot)
loading...

References

  • Carvalho CM, Polson NG, Scott JG. Handling sparsity via the horseshoe. In: van Dyk D, Welling M, editors. Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics; 2009. pp. 73–80. Available from http://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf.
  • Piironen J, Vehtari A. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics. 2017;11(2) 5018–5051. doi: 10.1214/17-EJS1337SI.