Sparse logistic regression

Tutorial: Sparse Logistic Regression

This tutorial demonstrates modeling and running inference on a sparse logistic regression model in Bean Machine. This tutorial showcases the inference techniques in Bean Machine, and applies the model to a public dataset to evaluate performance. It also introduces the @bm.functional decorator, which can be used to deterministically transform random variables which can be convenient for post-processing.

Problem

Logistic regression is a commonly used statistical method that allows us to predict a binary output from a set of independent variables. The sparse logistic regression is a type of logistic regression model which embeds feature selection in classification by adding overall and per-dimension scale factors. It is very applicable when dealing with high-dimensional data, such as classifying credit scores.

Sparse logistic regression is a hierarchical model with a sparse prior. We will use a horseshoe prior, Carvalho CM, Polson NG, Scott JG which induces sparsity using a combination of a global shrinkage scale factor that pushes the posterior mass of most model parameters to zero and a local scale that allows some of them to escape shrinkage.

Prerequisites

We will be using the following packages within this tutorial.

arviz and bokeh for interactive visualizations;
pandas, numpy, and scikit-learn for data manipulation.

Please import the following code packages for the rest of the code in the tutorial to work.

# Install Bean Machine in Colab if using Colab.
import sys


if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
    !pip install beanmachine

import os
import warnings

import arviz as az
import beanmachine.ppl as bm
import numpy as np
import sklearn
import sklearn.model_selection
import torch
import torch.distributions as dist
from beanmachine.ppl.inference.monte_carlo_samples import MonteCarloSamples
from beanmachine.tutorials.utils import plots
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, Span
from bokeh.plotting import show
from IPython.display import Markdown
from sklearn.metrics import RocCurveDisplay

The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for reproducibility.

# Eliminate excess UserWarnings from ArviZ.
warnings.filterwarnings("ignore")

# Plotting settings
az.rcParams["plot.backend"] = "bokeh"
az.rcParams["stats.hdi_prob"] = 0.89

# Manual seed
bm.seed(17)

# Other settings for the notebook
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ

Model

We have the following definitions.

$N$ : Size of the dataset.
$D$ : Number of features of the dataset.
$\tau$ : Global shrinkage of the model (input from the user).
$\beta_d$ : Coefficient corresponding to dimension $d\in D$ .
$\lambda_d$ : Local shrinkage for the coefficient corresponding to dimension $d\in D$ .

The model is defined mathematically as follows:

$\lambda_d\stackrel{iid}{\sim}\text{HalfCauchy}(0,1)$
$\beta_d\stackrel{iid}{\sim}\mathcal{N}(0,\tau\lambda)$
$y_n\stackrel{iid}{\sim}\text{Bernoulli}(\sigma({X}^\textsf{T}\beta))$

A few notes:

$\sigma(s)=\frac{1}{1+e^{-s}}$ is the logistic function. Its purpose is to translate an unconstrained score $s\in(-\infty,\infty)$ predicted by the model into a probability $\sigma(s)\in(0,1)$ .
$\tau$ can be an input from the user which depends on the expected number of non-zeros coefficients in the model. Alternatively, we can have a full Bayesian treatment of $\tau$ as suggested by Piironen and Vehtari. For simplicity, this is simply a constant in this tutorial but this can be replaced with a HalfNormal prior with the same scale as $\tau$ . This parameter is responsible for global shrinkage, whereas $\lambda_d$ tends to shrink the influence of $\beta_d$ but since the Cauchy distribution has fat tails, it can also help coefficients escape shrinkage.

We can implement this model in Bean Machine by defining random variable objects with the @bm.random_variable decorator. These functions behave differently than ordinary Python functions.

Semantics for @bm.random_variable functions:

They must return PyTorch Distribution objects.
Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
Please see the documentation for more information about this decorator.

Semantics for @bm.functional:

This decorator is used to deterministically transform the results of one or more random variables.

To implement sparse logistic regression model in Bean Machine, we provide @bm.random_variable definitions for $\lambda$ , $\beta$ , and $y$ . The value of $\tau$ is estimated from the number of expected non-zero coefficients and is an input from the user.

This is all you have to do to define the model. However, we'll also make use of the @bm.functional decorator to make it very easy to store the predictions (rather than the binary outcome, we'll store the probability score of being labeled 1) on the test data and for computing the log likelihood. This decorator has the same semantics as @bm.random_variable, except that it does not return a distribution. Instead, it returns a deterministically-computed function from other random variables. It can be used to conveniently compute values that would typically be computed in a post-processing pass. Here, we use it to compute the log probability of test data, using inferences made on training data.

class SparseLogisticRegression:
    def __init__(self, X_train, X_test, Y_test, nonzero_frac=0.3):
        self.X_train = X_train
        self.X_test = X_test
        self.Y_test = Y_test
        # See: Piironen and Vehtari
        self.tau = (
            2 * nonzero_frac / ((1 - nonzero_frac) * np.sqrt(self.X_train.shape[1]))
        )

    @bm.random_variable
    def lambda_(self):
        return dist.HalfCauchy(1.0).expand([self.X_train.shape[1], 1])

    @bm.random_variable
    def eps(self):
        return dist.Normal(0, 1).expand([self.X_train.shape[1], 1])

    @bm.random_variable
    def y(self):
        return dist.Bernoulli(logits=self.X_train @ self.beta())

    @bm.functional
    def beta(self):
        return self.tau * self.lambda_() * self.eps()

    @bm.functional
    def log_prob_test(self):
        return (
            dist.Bernoulli(logits=self.X_test @ self.beta()).log_prob(self.Y_test).sum()
        )

    @bm.functional
    def y_probs(self):
        return torch.sigmoid(self.X_test @ self.beta())

    def __repr__(self):
        return f"SparseLogisticRegression with {self.X_train.shape[1]} covariates"

Data

With the model defined, we need to collect some observed data in order to learn about values of interest in our model. For this tutorial, we'll use 1,000 five-dimensional data points where the response only depends on the first two, so that we can visualize what's going on.

We will generate a dataset where items have true label 0 and true label 1. For demonstrative purposes, we will use a synthetically generated dataset of observed values. In practice, you would gather real data and classify results by-hand, for example using human labelers.

X = dist.Normal(0, 5).expand([1000, 5]).sample()

tau0 = 2 * 0.3 / (0.7 * np.sqrt(1000))
lambda0 = torch.tensor([-70, 40, 0, 0, 0]).unsqueeze(-1)
true_beta = tau0 * lambda0

Y = dist.Bernoulli(logits=X @ true_beta).sample()

print(f"τ0: {tau0}")
print(f" β: {true_beta}")

Out:

τ0: 0.02710523708715754
 β: tensor([[-1.8974],
        [ 1.0842],
        [ 0.0000],
        [ 0.0000],
        [ 0.0000]])

Let's take a moment to visualize our dataset.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

red_x = X[:, :2][Y[:, 0] == 0, 0]
red_y = X[:, :2][Y[:, 0] == 0, 1]
red_label = ["red"] * len(red_x)
red_cds = ColumnDataSource(
    {"x": red_x.tolist(), "y": red_y.tolist(), "label": red_label}
)
red_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]
green_x = X[:, :2][Y[:, 0] == 1, 0]
green_y = X[:, :2][Y[:, 0] == 1, 1]
green_label = ["green"] * len(green_x)
green_cds = ColumnDataSource(
    {"x": green_x.tolist(), "y": green_y.tolist(), "label": green_label}
)
green_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]

synthetic_data_plot = plots.scatter_plot(
    plot_sources=[red_cds, green_cds],
    tooltips=[red_tips, green_tips],
    figure_kwargs={
        "title": "Synthetic data",
        "x_axis_label": "x",
        "y_axis_label": "y",
    },
    legend_items=["Label 0", "Label 1"],
    plot_kwargs={"fill_color": "label", "fill_alpha": 0.5},
)

show(synthetic_data_plot)

Now, we will split the dataset into a training set and a test set.

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y)

Now that we've got our data defined, we can instantiate an instance of the model.

model = SparseLogisticRegression(X_train, X_test, Y_test)

Our inference algorithms expect observations in the form of a dictionary. This dictionary should consist of @bm.random_variable invocations as keys, and tensor data as values. In order to bind this data, first we'll instantiate the model.

observations = {model.y(): Y_train}

Inference

Inference is the process of combining model with data to obtain insights, in the form of probability distributions over values of interest. Bean Machine offers a powerful and general inference framework to enable fitting arbitrary models to data.

Since this model is comprised entirely of differentiable random variables, we'll make use of the No U-Turn Sampler (NUTS) (Hoffman & Gelman, 2011).

Running inference consists of a few arguments:

Name	Usage
`queries`	List of `@bm.random_variable` targets to fit posterior distributions for.
`observations`	A dictionary of observations, as built above.
`num_samples`	Number of Monte Carlo samples to approximate the posterior distributions for the variables in `queries`.
`num_chains`	Number of separate inference runs to use. Multiple chains can help verify that inference ran correctly.

Let's run inference:

num_samples = 2 if smoke_test else 1000
num_adaptive_samples = 0 if smoke_test else num_samples // 2
num_chains = 1 if smoke_test else 2
# Experimental backend using the Pytorch NNC compiler
nnc_compile = "SANDCASTLE_NEXUS" not in os.environ

queries = [model.lambda_(), model.beta(), model.log_prob_test(), model.y_probs()]

samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(
    queries=queries,
    observations=observations,
    num_adaptive_samples=num_adaptive_samples,
    num_samples=num_samples,
    num_chains=num_chains,
)

Out:

Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]

Analysis

samples now contains our inference results.

First, we'll just print previews of the results. This should give a sense of how to work with the samples object, and also an idea of the shapes of the inferred values.

lambda_marginal = samples[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()
beta_marginal = samples[model.beta()].flatten(start_dim=0, end_dim=1).detach()
log_prob_test_results = samples[model.log_prob_test()][0].detach()

print(
    f"lambda_marginal:\n{lambda_marginal}\n\n"
    f"beta_marginal:\n{beta_marginal}\n\n"
    f"log_prob_test_results:\n{log_prob_test_results[:20]}"
)

Out:

lambda_marginal:
tensor([[[2.5968e+00],
         [4.3340e+00],
         [3.2614e-01],
         [6.6289e-01],
         [2.0303e-01]],
        [[4.0693e+00],
         [5.4920e+00],
         [6.1293e-01],
         [6.7453e-01],
         [7.8361e-01]],
        [[3.3681e+00],
         [5.0646e+00],
         [6.6694e-01],
         [4.4641e-01],
         [1.0607e+00]],
        ...,
        [[4.2111e+00],
         [7.1516e+00],
         [8.1101e-01],
         [3.8489e-02],
         [5.8832e-01]],
        [[7.4676e+00],
         [2.9920e+00],
         [1.5182e-01],
         [6.4247e-03],
         [4.9245e-01]],
        [[2.7034e+00],
         [3.3743e+00],
         [8.3389e-02],
         [2.8964e-02],
         [6.7897e-01]]])
beta_marginal:
tensor([[[-1.6024],
         [ 0.9802],
         [ 0.0034],
         [ 0.0079],
         [-0.0106]],
        [[-2.1407],
         [ 1.3242],
         [ 0.0538],
         [ 0.0720],
         [-0.0625]],
        [[-1.6937],
         [ 1.0745],
         [ 0.0660],
         [ 0.0489],
         [-0.0571]],
        ...,
        [[-1.7365],
         [ 1.1237],
         [ 0.0771],
         [-0.0349],
         [-0.0976]],
        [[-2.0167],
         [ 1.2773],
         [ 0.0269],
         [-0.0059],
         [-0.0655]],
        [[-1.9680],
         [ 1.1761],
         [-0.0377],
         [-0.0356],
         [-0.0355]]])
log_prob_test_results:
tensor([-29.5386, -33.0372, -32.8408, -32.0504, -32.3691, -33.4062, -30.8081,
        -30.1592, -31.7247, -30.3864, -30.9796, -31.5982, -31.6301, -33.7440,
        -34.1124, -31.4896, -33.1069, -31.1966, -34.1167, -33.6170])

Next, let's visualize the inferred random variables.

Below we plot the joint of the first two components of $\beta$ and $\lambda$ (as a reminder, the remaining components are noise which the outcome variable doesn't depend on).

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

beta_marginals_plot = plots.marginal_2d(
    x=beta_marginal[:, :2][:, 0].flatten().numpy(),
    y=beta_marginal[:, :2][:, 1].flatten().numpy(),
    x_label="β0",
    y_label="β1",
    title="β0 - β1 joint plot",
    true_x=float(true_beta[:2, :].flatten()[0]),
    true_y=float(true_beta[:2, :].flatten()[1]),
)
show(beta_marginals_plot)

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

lambda_marginals_plot = plots.marginal_2d(
    x=lambda_marginal[:, :2][:, 0].flatten().numpy(),
    y=lambda_marginal[:, :2][:, 1].flatten().numpy(),
    x_label="λ0",
    y_label="λ1",
    title="λ0 - λ1 joint plot",
)
show(lambda_marginals_plot)

The marginals and the joint for $\beta$ and $\lambda$ look reasonable. We note that posterior mean for $\beta$ closely matches the true values used to generate the data. Note that the $\lambda$ values can be fairly large if needed since we place a HalfCauchy prior over it. Even in the presence of strong global shrinkage, this lets local parameters take on larger values if needed.

Now let us look at the histogram for the remaining values of $\beta$ , i.e. the coefficients for variables which have no effect on the outcome.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

betas = samples[model.beta()].squeeze(-1).reshape(-1, 5)
plot_sources = []
labels = []
tooltips = []
for i in range(5):
    b = betas[:, i].numpy()
    support, density = az.stats.density_utils.kde(b)
    density /= density.max()
    cds = ColumnDataSource({"x": support.tolist(), "y": density.tolist()})
    label = f"β{i}"
    tips = [(f"{label}", "@support{0.000}")]
    plot_sources.append(cds)
    labels.append(label)
    tooltips.append(tips)

betas_density_plot = plots.line_plot(
    plot_sources=plot_sources,
    labels=labels,
    tooltips=tooltips,
    plot_kwargs={"line_width": 2},
    figure_kwargs={
        "title": "Histogram of some sample model coefficients",
        "x_axis_label": "β",
    },
)
betas_density_plot.yaxis.visible = False
betas_density_plot.legend.click_policy = "mute"
show(betas_density_plot)

We note that the majority of the posterior mass on all the variables that the outcome doesn't depend on is centered at 0 and the only non-zero values are the first two variables. This validates how the horseshoe prior helps induce sparsity.

Lastly, let's plot the first thousand log probabilities on test data that we generated per-iteration during inference. We'll also overlay the log probability that the test dataset would score on the ground truth parameters.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

ground_truth_log_prob = float(
    dist.Bernoulli(logits=X_test @ true_beta).log_prob(Y_test).sum()
)
y = log_prob_test_results.numpy()
x = np.arange(len(y))
cds = ColumnDataSource({"x": x.tolist(), "y": y.tolist()})
tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]
log_prob_plot = plots.line_plot(
    plot_sources=[cds],
    labels=[f"Using true params = {ground_truth_log_prob :.2f}"],
    tooltips=[tips],
    figure_kwargs={
        "title": "Log probability",
        "plot_width": 800,
        "x_axis_label": "Iteration",
        "y_axis_label": "Log probability",
    },
    plot_kwargs={"line_alpha": 0.5},
)
# Add a line showing the ground truth.
span = Span(
    location=ground_truth_log_prob,
    dimension="width",
    line_color="black",
    line_width=3,
)
log_prob_plot.add_layout(span)
log_prob_plot.legend.location = "bottom_left"
show(log_prob_plot)

Let us plot the predictions from the model on the test dataset as a sanity check. For simplicity, we will compute the mean of the probability score (computed as y_probs in the model) on each test data point and assume a label of 0 if the score is less than 0.5 and conversely a label of 1 if the score is above 0.5.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

y_probs = samples[model.y_probs()].mean([0, 1])
predicted_red_x = X_test[y_probs[:, 0] < 0.5, 0]
predicted_red_y = X_test[y_probs[:, 0] < 0.5, 1]
predicted_red_label = ["red"] * len(predicted_red_x)
predicted_red_cds = ColumnDataSource(
    {
        "x": predicted_red_x.tolist(),
        "y": predicted_red_y.tolist(),
        "label": predicted_red_label,
    }
)
predicted_red_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]
predicted_green_x = X_test[y_probs[:, 0] >= 0.5, 0]
predicted_green_y = X_test[y_probs[:, 0] >= 0.5, 1]
predicted_green_label = ["green"] * len(predicted_green_x)
predicted_green_cds = ColumnDataSource(
    {
        "x": predicted_green_x.tolist(),
        "y": predicted_green_y.tolist(),
        "label": predicted_green_label,
    }
)
predicted_green_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]
predicted_data_plot = plots.scatter_plot(
    plot_sources=[predicted_red_cds, predicted_green_cds],
    tooltips=[predicted_red_tips, predicted_green_tips],
    figure_kwargs={
        "title": "Synthetic data",
        "x_axis_label": "x",
        "y_axis_label": "y",
    },
    legend_items=["Predicted 0", "Predicted 1"],
    plot_kwargs={"fill_color": "label", "fill_alpha": 0.5},
)
# Select one point and annotate it.
predicted_data_plot.circle(
    x=[float(X_test[11, 0])],
    y=[float(X_test[11, 1])],
    size=20,
    fill_color=None,
    line_color="steelblue",
    line_width=2,
    line_alpha=1,
)
predicted_data_plot.legend.location = "bottom_left"
show(predicted_data_plot)

As we can see, the model seems to have recovered the same decision boundary, and MCMC has converged to parameters that correctly predicted the test dataset. This was just a sanity check. Since we have built a probabilistic model, the probability that each data point has label 1 is itself a distribution! Let us see this for one of the test data points (marked by a blue circle above).

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

data_point11_plot = plots.histogram_plot(
    data=samples[model.y_probs()][..., 11, 0].reshape(-1).numpy(),
    figure_kwargs={"title": "Probability of test datapoint 11 having Label 1"},
)
show(data_point11_plot)

The MonteCarloSamples from Bean Machine's inference can be converted to arviz's internal InferenceData format so that we can make use of its inference statistics and plotting utilities for diagnostics.

filtered_samples = {
    k: v for k, v in samples.items() if k in {model.beta(), model.lambda_()}
}
az_data = MonteCarloSamples(filtered_samples).to_inference_data()
summary_df = az.summary(az_data, round_to=3)
Markdown(summary_df.to_markdown())

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
beta(SparseLogisticRegression with 5 covariates,)[0, 0]	-1.834	0.195	-2.112	-1.506	0.004	0.003	1928.09	1492.71	0.999
beta(SparseLogisticRegression with 5 covariates,)[1, 0]	1.124	0.121	0.939	1.314	0.003	0.002	1942.43	1465.61	1
beta(SparseLogisticRegression with 5 covariates,)[2, 0]	0.02	0.03	-0.027	0.068	0.001	0.001	1384.37	1871.51	1.002
beta(SparseLogisticRegression with 5 covariates,)[3, 0]	0.035	0.033	-0.01	0.09	0.001	0.001	826.091	1277.33	1.001
beta(SparseLogisticRegression with 5 covariates,)[4, 0]	-0.055	0.037	-0.114	0.004	0.001	0.001	1291.37	1073.18	1
lambda_(SparseLogisticRegression with 5 covariates,)[0, 0]	5.203	3.894	1.502	9.201	0.149	0.105	750.873	843.618	1.002
lambda_(SparseLogisticRegression with 5 covariates,)[1, 0]	3.453	2.629	0.856	6.009	0.119	0.084	634.313	619.952	1.004
lambda_(SparseLogisticRegression with 5 covariates,)[2, 0]	0.499	0.693	0.001	1.137	0.024	0.017	512.808	721.796	1.003
lambda_(SparseLogisticRegression with 5 covariates,)[3, 0]	0.63	1.36	0	1.265	0.048	0.034	413.859	566.379	1.002
lambda_(SparseLogisticRegression with 5 covariates,)[4, 0]	0.641	0.859	0	1.297	0.027	0.019	595.7	732.401	1.002

The summary above includes useful statistics about each marginal distribution. Below we describe what some of these useful statistics are.

$\hat{R}$ diagnostic

$\hat{R}$ is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then $\hat{R}\approx 1$ , since the between-chain and within-chain variance should be equal. $\hat{R}$ is calculated as

\hat{R}=\frac{\hat{V}}{W}

where $W$ is the within-chain variance and $\hat{V}$ is the posterior variance estimate for the pooled rank-traces. The take-away here is that $\hat{R}$ converges towards 1 when each of the Markov chains approaches perfect adaptation to the true posterior distribution. We do not recommend using inference results if $\hat{R}>1.01$ . More information about $\hat{R}$ can be found in the Vehtari et al paper.

Effective sample size $ess$ diagnostic

MCMC samplers do not draw independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper, briefly it is a measure that combines information from the $\hat{R}$ value with the autocorrelation estimates within the chains. There are many ways to estimate effective samples sizes, however, we will be using the method defined in the Vehtari et al paper.

The rule of thumb for ess_bulk is for this value to be greater than 100 per chain on average. Since we ran four chains, we need ess_bulk to be greater than 200 for each parameter. The ess_tail is an estimate for effectively independent samples considering the more extreme values of the posterior. This is not the number of samples that landed in the tails of the posterior. It is a measure of the number of effectively independent samples if we sampled the tails of the posterior. The rule of thumb for this value is also to be greater than 100 per chain on average.

For comparison, let's check out the model's performance using a couple of different inference methods.

Single-site Metropolis-Hastings

Let's retry this problem, using ancestral Metropolis-Hastings as the inference algorithm to compare performance.

Ancestral Metropolis-Hastings is a simple inference algorithm, which proposes child random variables conditional on values for the parent random variables. The most ancestral random variables are simply sampled from the prior distribution.

queries_mh = [model.lambda_(), model.beta(), model.log_prob_test()]
num_samples_mh = 2 * num_samples

samples_mh = bm.SingleSiteAncestralMetropolisHastings().infer(
    queries=queries_mh,
    observations=observations,
    num_samples=num_samples_mh,
    num_chains=num_chains,
)

Out:

Samples collected:   0%|          | 0/2000 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/2000 [00:00<?, ?it/s]

lambda_marginal_mh = (
    samples_mh[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()
)
beta_marginal_mh = samples_mh[model.beta()].flatten(start_dim=0, end_dim=1).detach()
log_prob_test_results_mh = (
    samples_mh[model.log_prob_test()].flatten(start_dim=0, end_dim=1).detach()
)

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

beta_marginals_mh_plot = plots.marginal_2d(
    x=beta_marginal_mh[:, :2][:, 0].flatten().numpy(),
    y=beta_marginal_mh[:, :2][:, 1].flatten().numpy(),
    x_label="β0",
    y_label="β1",
    title="β0 - β1 joint plot / Single Site Ancestral Metropolis Hastings",
    true_x=float(true_beta[:2, :].flatten()[0]),
    true_y=float(true_beta[:2, :].flatten()[1]),
)
show(beta_marginals_mh_plot)

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

lambda_marginals_mh_plot = plots.marginal_2d(
    x=lambda_marginal_mh[:, :2][:, 0].flatten().numpy(),
    y=lambda_marginal_mh[:, :2][:, 1].flatten().numpy(),
    x_label="λ0",
    y_label="λ1",
    title="λ0 - λ1 joint plot / Single Site Ancestral Metropolis Hastings",
)
show(lambda_marginals_mh_plot)

From all of the above plots, we see that ancestral Metropolis-Hastings does a significantly worse job at recovering the true parameters! Not only do regions of uncertainty tend to exclude the true values, the samples that are actually drawn are very sparse. This means that the algorithm is achieving a very poor representation of the posterior surface.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

y = log_prob_test_results_mh.numpy()[:1000]
x = np.arange(len(y))
cds = ColumnDataSource({"x": x.tolist(), "y": y.tolist()})
tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]
log_prob_mh_plot = plots.line_plot(
    plot_sources=[cds],
    labels=[f"Using true params = {ground_truth_log_prob :.2f}"],
    tooltips=[tips],
    figure_kwargs={
        "title": "Log probability",
        "plot_width": 800,
        "x_axis_label": "Iteration",
        "y_axis_label": "Log probability",
    },
    plot_kwargs={"line_alpha": 0.5},
)
# Add a line showing the ground truth.
span = Span(
    location=ground_truth_log_prob,
    dimension="width",
    line_color="black",
    line_width=3,
)
log_prob_mh_plot.add_layout(span)
log_prob_mh_plot.legend.location = "bottom_left"
show(log_prob_mh_plot)

Surprisingly, the algorithm does eventually discover — and settle in on — parameters that seem to describe the test data well.

summary_mh_df = az.summary(samples_mh.to_inference_data(), round_to=3)
Markdown(summary_mh_df.to_markdown())

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
log_prob_test(SparseLogisticRegression with 5 covariates,)	-34.391	18.654	-37.696	-30.967	1.773	1.257	4.104	14.836	1.517
beta(SparseLogisticRegression with 5 covariates,)[0, 0]	-1.602	0.23	-1.785	-1.197	0.095	0.081	3.735	2.312	1.485
beta(SparseLogisticRegression with 5 covariates,)[1, 0]	0.989	0.188	0.725	1.152	0.096	0.08	3.326	9.739	1.604
beta(SparseLogisticRegression with 5 covariates,)[2, 0]	0	0.111	-0.117	0.115	0.074	0.061	2.313	2.312	2.825
beta(SparseLogisticRegression with 5 covariates,)[3, 0]	0.041	0.045	0.017	0.113	0.011	0.008	6.909	2.836	1.477
beta(SparseLogisticRegression with 5 covariates,)[4, 0]	-0.004	0.09	-0.101	0.075	0.059	0.048	2.626	2.323	2.682
lambda_(SparseLogisticRegression with 5 covariates,)[0, 0]	2.389	0.674	1.754	3.141	0.457	0.381	2.224	2.02	3.968
lambda_(SparseLogisticRegression with 5 covariates,)[1, 0]	4.016	0.559	3.909	4.377	0.298	0.255	2.289	7.613	2.915
lambda_(SparseLogisticRegression with 5 covariates,)[2, 0]	0.185	0.079	0.176	0.217	0.011	0.008	4.494	8.693	2.887
lambda_(SparseLogisticRegression with 5 covariates,)[3, 0]	1.085	1.601	0.315	1.049	0.711	0.535	2.261	2.008	3.069
lambda_(SparseLogisticRegression with 5 covariates,)[4, 0]	0.439	0.314	0.149	0.696	0.209	0.173	2.284	2.008	2.941

$\hat{R}$ values are extremely far from $1.0$ . Further, the $ess$ and values for our sampled random variables are extremely small. As a result, these inference results would be unusable for any real application.

In comparison, NUTS seems to have developed a much more complete representation of the posterior surface. Now that we have validated the model on a synthetic dataset, we will look at a real-world example.

German-numeric data

In the above examples, we used a 2D dataset, simply to make visualizing the inferences more intuitive. Let's retry our sparse logistic regression on a real-world dataset, the German credit dataset.

The German credit dataset is a collection of 1,000 data points. Each data point represents a person who borrows from a bank. Each person is classified as either a good or a bad credit risk according to the bank. Each person contains 24 numeric covariates that may or may not be useful predictors for credit risk. Example covariates include age, sex, and savings. The response variable is either 1, indicating good credit, or 2, indicating bad credit. You can read more about this dataset, and in particular what each of the first 24 covariate columns represent in the documentation.

input_data = np.genfromtxt(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data-numeric"
)

Let's scale covariates to range between -1 and 1, and add a constant factor. This yields a total of 25 covariates. We'll also translate the response variable to [0, 1].

X_all = torch.from_numpy(
    np.hstack(
        [
            np.ones((input_data.shape[0], 1)),
            sklearn.preprocessing.minmax_scale(
                input_data[:, :-1],
                feature_range=(-1, 1),
            ),
        ]
    )
).float()
Y_all = torch.from_numpy(sklearn.preprocessing.minmax_scale(input_data[:, -1:])).float()

Now, we'll split into training and test data.

X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(
    X_all,
    Y_all,
)

The rest of the setup is the same as the tutorial with synthetic data.

First, we'll instantiate our model.

model = SparseLogisticRegression(X_train, X_test, Y_test)

Next, we'll bind our observations.

observations = {model.y(): Y_train}

Now, we're ready to fit the model.

german_queries = [
    model.lambda_(),
    model.beta(),
    model.log_prob_test(),
    model.y_probs(),
]
german_samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(
    queries=german_queries,
    observations=observations,
    num_adaptive_samples=num_adaptive_samples,
    num_samples=num_samples,
    num_chains=num_chains,
)

Out:

Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]
Samples collected:   0%|          | 0/1500 [00:00<?, ?it/s]

german_lambda_marginal = (
    german_samples[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()
)
german_beta_marginal = (
    german_samples[model.beta()].flatten(start_dim=0, end_dim=1).detach()
)
german_log_prob_test_results = (
    german_samples[model.log_prob_test()].flatten(start_dim=0, end_dim=1).detach()
)

This model is too high-dimensional to plot the same way that we did with the previous examples. Let us look at the log density of the samples generated by MCMC. Sometimes the chains can get stuck making little progress, which can be seen from this plot. We observe that NUTS is able to draw samples quite efficiently from across the posterior distribution (compare with Metropolis-Hastings above).

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

german_y = german_log_prob_test_results[:1000]
german_x = np.arange(len(y))
german_cds = ColumnDataSource({"x": german_x.tolist(), "y": german_y.tolist()})
german_tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]
german_log_prob_plot = plots.line_plot(
    plot_sources=[german_cds],
    tooltips=[german_tips],
    figure_kwargs={
        "title": "Log probability for German credit risk data",
        "plot_width": 800,
        "x_axis_label": "Iteration",
        "y_axis_label": "Log probability",
    },
    plot_kwargs={"line_alpha": 0.5},
)
show(german_log_prob_plot)

Let us also look at the diagnostics to ensure that the MCMC chains have converged to the posterior distribution. We note that $\hat{R}$ are all close to 1 and the $ess$ values look reasonable.

filtered_samples = {
    k: v for k, v in german_samples.items() if k in {model.beta(), model.lambda_()}
}
az_data = MonteCarloSamples(filtered_samples).to_inference_data()
german_summary_df = az.summary(az_data, round_to=3)
Markdown(german_summary_df.to_markdown())

	mean	sd	hdi_5.5%	hdi_94.5%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
beta(SparseLogisticRegression with 25 covariates,)[0, 0]	-0.218	0.356	-0.756	0.23	0.012	0.008	953.99	1267.42	1.003
beta(SparseLogisticRegression with 25 covariates,)[1, 0]	-0.782	0.12	-0.965	-0.588	0.003	0.002	1616.33	1357.34	1.001
beta(SparseLogisticRegression with 25 covariates,)[2, 0]	1.347	0.299	0.867	1.819	0.007	0.005	1915.53	1591.96	1.002
beta(SparseLogisticRegression with 25 covariates,)[3, 0]	-0.475	0.205	-0.799	-0.156	0.006	0.005	984.341	782.178	1.001
beta(SparseLogisticRegression with 25 covariates,)[4, 0]	0.057	0.189	-0.189	0.386	0.004	0.003	1863.57	1583.56	1.001
beta(SparseLogisticRegression with 25 covariates,)[5, 0]	-0.291	0.144	-0.537	-0.07	0.004	0.003	1077.02	591.582	1.003
beta(SparseLogisticRegression with 25 covariates,)[6, 0]	-0.348	0.179	-0.587	0.006	0.007	0.005	785.977	456.259	1
beta(SparseLogisticRegression with 25 covariates,)[7, 0]	-0.123	0.158	-0.358	0.106	0.004	0.003	1294.97	1653.24	1.001
beta(SparseLogisticRegression with 25 covariates,)[8, 0]	-0.011	0.083	-0.155	0.108	0.002	0.001	1682.91	1639.8	1
beta(SparseLogisticRegression with 25 covariates,)[9, 0]	0.183	0.154	-0.036	0.405	0.005	0.003	894.553	1343.35	1.001
beta(SparseLogisticRegression with 25 covariates,)[10, 0]	-0.068	0.147	-0.348	0.106	0.004	0.003	1349.61	1360.98	1
beta(SparseLogisticRegression with 25 covariates,)[11, 0]	-0.214	0.128	-0.391	0.011	0.004	0.003	1046.93	507.06	1.005
beta(SparseLogisticRegression with 25 covariates,)[12, 0]	0.111	0.174	-0.114	0.391	0.004	0.003	1545.6	1841.1	1.001
beta(SparseLogisticRegression with 25 covariates,)[13, 0]	0.013	0.079	-0.107	0.145	0.002	0.001	1694.56	1724.96	1.002
beta(SparseLogisticRegression with 25 covariates,)[14, 0]	-0.088	0.089	-0.233	0.034	0.003	0.002	1096.82	1181.23	1.002
beta(SparseLogisticRegression with 25 covariates,)[15, 0]	-0.275	0.294	-0.727	0.073	0.009	0.007	832.849	1529.49	1.007
beta(SparseLogisticRegression with 25 covariates,)[16, 0]	0.243	0.123	0.047	0.446	0.005	0.003	686.886	199.985	1.003
beta(SparseLogisticRegression with 25 covariates,)[17, 0]	-0.31	0.202	-0.594	0.018	0.007	0.005	717.282	510.955	1.002
beta(SparseLogisticRegression with 25 covariates,)[18, 0]	0.221	0.212	-0.043	0.56	0.007	0.005	629.229	1192.16	1.009
beta(SparseLogisticRegression with 25 covariates,)[19, 0]	0.263	0.285	-0.104	0.689	0.009	0.006	781.196	1412.12	1.008
beta(SparseLogisticRegression with 25 covariates,)[20, 0]	0.151	0.138	-0.056	0.359	0.004	0.003	1081.01	1508.05	1
beta(SparseLogisticRegression with 25 covariates,)[21, 0]	-0.062	0.108	-0.253	0.082	0.003	0.002	929.626	1422.19	1.002
beta(SparseLogisticRegression with 25 covariates,)[22, 0]	0.015	0.14	-0.192	0.249	0.004	0.003	1312.97	809.934	1.003
beta(SparseLogisticRegression with 25 covariates,)[23, 0]	0.021	0.086	-0.117	0.153	0.002	0.001	1606.22	1499.37	1.001
beta(SparseLogisticRegression with 25 covariates,)[24, 0]	-0.016	0.072	-0.131	0.1	0.002	0.002	886.206	968.057	1.001
lambda_(SparseLogisticRegression with 25 covariates,)[0, 0]	2.101	2.943	0	4.795	0.098	0.07	1039.31	1018.09	1.002
lambda_(SparseLogisticRegression with 25 covariates,)[1, 0]	5.307	4.308	1.269	9.595	0.158	0.112	726.059	940.254	1
lambda_(SparseLogisticRegression with 25 covariates,)[2, 0]	8.785	7.265	1.756	16.159	0.258	0.183	627.759	1031.12	1.004
lambda_(SparseLogisticRegression with 25 covariates,)[3, 0]	3.632	4.056	0.276	7.088	0.131	0.093	695.994	859.273	1.006
lambda_(SparseLogisticRegression with 25 covariates,)[4, 0]	1.294	1.845	0.001	2.795	0.044	0.033	1544.61	963.183	1.003
lambda_(SparseLogisticRegression with 25 covariates,)[5, 0]	2.592	3.226	0.005	4.682	0.12	0.085	768.199	695.669	1
lambda_(SparseLogisticRegression with 25 covariates,)[6, 0]	2.836	3.357	0.003	5.179	0.117	0.083	603.47	590.562	0.999
lambda_(SparseLogisticRegression with 25 covariates,)[7, 0]	1.415	2.102	0.005	2.845	0.06	0.046	912.985	1057.83	1.006
lambda_(SparseLogisticRegression with 25 covariates,)[8, 0]	0.951	1.472	0.002	1.942	0.072	0.051	842.951	239.897	1.003
lambda_(SparseLogisticRegression with 25 covariates,)[9, 0]	1.732	2.288	0.001	3.345	0.07	0.049	692.473	735.36	1.003
lambda_(SparseLogisticRegression with 25 covariates,)[10, 0]	1.321	2.268	0.001	2.605	0.066	0.047	1168.43	1075.63	1.001
lambda_(SparseLogisticRegression with 25 covariates,)[11, 0]	1.797	1.908	0.015	3.392	0.06	0.042	667.351	630.821	1.002
lambda_(SparseLogisticRegression with 25 covariates,)[12, 0]	1.417	2.074	0.003	2.941	0.059	0.045	1363.8	1021.44	1.005
lambda_(SparseLogisticRegression with 25 covariates,)[13, 0]	0.91	1.238	0	1.85	0.033	0.023	1213.41	759.393	1.005
lambda_(SparseLogisticRegression with 25 covariates,)[14, 0]	1.161	1.607	0.001	2.331	0.047	0.033	912.611	700.183	1.001
lambda_(SparseLogisticRegression with 25 covariates,)[15, 0]	2.245	2.497	0.001	5.206	0.072	0.051	844.69	724.233	1.007
lambda_(SparseLogisticRegression with 25 covariates,)[16, 0]	2.209	2.714	0.023	4.023	0.079	0.056	952.933	671.074	1.002
lambda_(SparseLogisticRegression with 25 covariates,)[17, 0]	2.508	2.787	0.003	4.857	0.103	0.073	544.242	342.13	1.002
lambda_(SparseLogisticRegression with 25 covariates,)[18, 0]	2.11	2.697	0.007	4.187	0.079	0.056	653.669	699.592	1.008
lambda_(SparseLogisticRegression with 25 covariates,)[19, 0]	2.559	4.116	0.001	4.916	0.418	0.383	279.064	212.112	1.014
lambda_(SparseLogisticRegression with 25 covariates,)[20, 0]	1.568	1.894	0.002	3.097	0.06	0.042	661.574	711.995	1.002
lambda_(SparseLogisticRegression with 25 covariates,)[21, 0]	1.053	1.329	0.002	2.09	0.036	0.025	1098.9	798.229	1.002
lambda_(SparseLogisticRegression with 25 covariates,)[22, 0]	1.18	1.706	0.003	2.394	0.056	0.043	1126.2	1024.51	1.003
lambda_(SparseLogisticRegression with 25 covariates,)[23, 0]	0.917	1.438	0	1.87	0.042	0.032	958.948	792.467	1.003
lambda_(SparseLogisticRegression with 25 covariates,)[24, 0]	0.86	1.369	0.002	1.75	0.044	0.031	1157.46	1089.54	1.002

To check for sparsity, let us plot the histogram of the marginals for a selection of coefficients ( $\beta$ ) in the model. We note that many of the parameters have their posterior probability mass concentrated at 0, resulting in a sparse model, as we desired.

# Required for visualizing in Colab.
output_notebook(hide_banner=True)

german_betas = german_samples[model.beta()].squeeze(-1).reshape(-1, 25)
german_plot_sources = []
german_labels = []
german_tooltips = []
for i in range(25):
    b = german_betas[:, i].numpy()
    support, density = az.stats.density_utils.kde(b)
    density /= density.max()
    cds = ColumnDataSource({"x": support.tolist(), "y": density.tolist()})
    label = f"β{i}"
    tips = [(f"{label}", "@support{0.000}")]
    german_plot_sources.append(cds)
    german_labels.append(label)
    german_tooltips.append(tips)

german_betas_density_plot = plots.line_plot(
    plot_sources=german_plot_sources,
    labels=german_labels,
    tooltips=german_tooltips,
    plot_kwargs={"line_width": 1, "line_alpha": 0.4},
    figure_kwargs={
        "title": "Histogram of some sample model coefficients",
        "x_axis_label": "β",
        "plot_height": 700,
    },
)
german_betas_density_plot.yaxis.visible = False
german_betas_density_plot.legend.click_policy = "mute"
show(german_betas_density_plot)

Finally, let us evaluate the mean predictions from the model using the ROC (Receiver Operating Characteristic) plot. The area under the curve (AUC) is 0.83 which shows that the model is able to achieve a good classification accuracy. This is merely evaluating the mean forecast from the model, but as we saw earlier, we have a full distribution of the probability scores corresponding to each test data.

y_true = Y_test.squeeze(-1).numpy()
y_pred_mean = german_samples[model.y_probs()].squeeze(-1).mean([0, 1]).numpy()
false_positive_rate, true_positive_rate, _ = sklearn.metrics._ranking.roc_curve(
    y_true,
    y_pred_mean,
)
roc_score = sklearn.metrics.roc_auc_score(y_true, y_pred_mean)
roc_cds = ColumnDataSource(
    {
        "x": false_positive_rate.tolist(),
        "y": true_positive_rate.tolist(),
    }
)
roc_tips = [("TPF", "@y{0.000}"), ("FPR", "@x{0.000}")]
roc_plot = plots.line_plot(
    plot_sources=[roc_cds],
    tooltips=[roc_tips],
    labels=[f"Classifier (AUC = {roc_score:.2f})"],
    figure_kwargs={
        "title": "Receiver Operator Characteristic",
        "x_axis_label": "False positive rate (positive label: 1)",
        "y_axis_label": "True positive rate (positive label: 1)",
    },
    plot_kwargs={
        "line_width": 3,
        "line_alpha": 0.7,
        "hover_line_color": "orange",
        "hover_line_alpha": 1,
    },
)
roc_plot.legend.location = "bottom_right"
show(roc_plot)

References

Carvalho CM, Polson NG, Scott JG. Handling sparsity via the horseshoe. In: van Dyk D, Welling M, editors. Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics; 2009. pp. 73–80. Available from http://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf.
Piironen J, Vehtari A. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics. 2017;11(2) 5018–5051. doi: 10.1214/17-EJS1337SI.

Tutorial: Sparse Logistic Regression​

Problem​

Prerequisites​

Model​

Data​

Inference​

Analysis​

R^\hat{R}R^ diagnostic​

Effective sample size essessess diagnostic​

Single-site Metropolis-Hastings​

German-numeric data​

References

Tutorial: Sparse Logistic Regression

Problem

Prerequisites

Model

Data

Inference

Analysis

$\hat{R}$ diagnostic

Effective sample size $ess$ diagnostic

Single-site Metropolis-Hastings

German-numeric data