# Sparse logistic regression

## Tutorial: **Sparse Logistic Regression**β

This tutorial demonstrates modeling and running inference on a sparse logistic
regression model in Bean Machine. This tutorial showcases the inference techniques in
Bean Machine, and applies the model to a public dataset to evaluate performance. It also
introduces the `@bm.functional`

decorator, which can be used to deterministically
transform random variables which can be convenient for post-processing.

## Problemβ

Logistic regression is a commonly used statistical method that allows us to predict a binary output from a set of independent variables. The sparse logistic regression is a type of logistic regression model which embeds feature selection in classification by adding overall and per-dimension scale factors. It is very applicable when dealing with high-dimensional data, such as classifying credit scores.

Sparse logistic regression is a hierarchical model with a sparse prior. We will use a horseshoe prior, Carvalho CM, Polson NG, Scott JG which induces sparsity using a combination of a global shrinkage scale factor that pushes the posterior mass of most model parameters to zero and a local scale that allows some of them to escape shrinkage.

## Prerequisitesβ

We will be using the following packages within this tutorial.

- arviz and bokeh for interactive visualizations;
- pandas, numpy, and scikit-learn for data manipulation.

Please import the following code packages for the rest of the code in the tutorial to work.

`# Install Bean Machine in Colab if using Colab.`

import sys

if "google.colab" in sys.modules and "beanmachine" not in sys.modules:

!pip install beanmachine

`import os`

import warnings

import arviz as az

import beanmachine.ppl as bm

import numpy as np

import sklearn

import sklearn.model_selection

import torch

import torch.distributions as dist

from beanmachine.ppl.inference.monte_carlo_samples import MonteCarloSamples

from beanmachine.tutorials.utils import plots

from bokeh.io import output_notebook

from bokeh.models import ColumnDataSource, Span

from bokeh.plotting import show

from IPython.display import Markdown

from sklearn.metrics import RocCurveDisplay

The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for reproducibility.

`# Eliminate excess UserWarnings from ArviZ.`

warnings.filterwarnings("ignore")

# Plotting settings

az.rcParams["plot.backend"] = "bokeh"

az.rcParams["stats.hdi_prob"] = 0.89

# Manual seed

bm.seed(17)

# Other settings for the notebook

smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ

## Modelβ

We have the following definitions.

- $N$: Size of the dataset.
- $D$: Number of features of the dataset.
- $\tau$: Global shrinkage of the model (input from the user).
- $\beta_d$: Coefficient corresponding to dimension $d\in D$.
- $\lambda_d$: Local shrinkage for the coefficient corresponding to dimension $d\in D$.

The model is defined mathematically as follows:

- $\lambda_d\stackrel{iid}{\sim}\text{HalfCauchy}(0,1)$
- $\beta_d\stackrel{iid}{\sim}\mathcal{N}(0,\tau\lambda)$
- $y_n\stackrel{iid}{\sim}\text{Bernoulli}(\sigma({X}^\textsf{T}\beta))$

A few notes:

- $\sigma(s)=\frac{1}{1+e^{-s}}$ is the logistic function. Its purpose is to translate an unconstrained score $s\in(-\infty,\infty)$ predicted by the model into a probability $\sigma(s)\in(0,1)$.
- $\tau$ can be an input from the user which depends on the expected number of non-zeros
coefficients in the model. Alternatively, we can have a full Bayesian treatment of
$\tau$ as suggested by Piironen and Vehtari. For simplicity, this is
simply a constant in this tutorial but this can be replaced with a
`HalfNormal`

prior with the same scale as $\tau$. This parameter is responsible for global shrinkage, whereas $\lambda_d$ tends to shrink the influence of $\beta_d$ but since the Cauchy distribution has fat tails, it can also help coefficients escape shrinkage.

We can implement this model in Bean Machine by defining random variable objects with the
`@bm.random_variable`

decorator. These functions behave differently than ordinary Python
functions.

`@bm.random_variable`

functions:- They must return PyTorch
`Distribution`

objects. - Though they return distributions, callees actually receive
*samples*from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine. - Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value:
**calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration**. This makes it easy for multiple components of your model to refer to the same logical random variable. - Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
- Please see the documentation for more information about this decorator.

`@bm.functional`

:- This decorator is used to deterministically transform the results of one or more random variables.

To implement sparse logistic regression model in Bean Machine, we provide
`@bm.random_variable`

definitions for $\lambda$, $\beta$, and $y$. The value of $\tau$
is estimated from the number of expected non-zero coefficients and is an input from the
user.

This is all you have to do to define the model. However, we'll also make use of the
`@bm.functional`

decorator to make it very easy to store the predictions (rather than
the binary outcome, we'll store the probability score of being labeled 1) on the test
data and for computing the log likelihood. This decorator has the same semantics as
`@bm.random_variable`

, except that it does not return a distribution. Instead, it
returns a deterministically-computed function from other random variables. It can be
used to conveniently compute values that would typically be computed in a
post-processing pass. Here, we use it to compute the log probability of test data, using
inferences made on training data.

`class SparseLogisticRegression:`

def __init__(self, X_train, X_test, Y_test, nonzero_frac=0.3):

self.X_train = X_train

self.X_test = X_test

self.Y_test = Y_test

# See: Piironen and Vehtari

self.tau = (

2 * nonzero_frac / ((1 - nonzero_frac) * np.sqrt(self.X_train.shape[1]))

)

@bm.random_variable

def lambda_(self):

return dist.HalfCauchy(1.0).expand([self.X_train.shape[1], 1])

@bm.random_variable

def eps(self):

return dist.Normal(0, 1).expand([self.X_train.shape[1], 1])

@bm.random_variable

def y(self):

return dist.Bernoulli(logits=self.X_train @ self.beta())

@bm.functional

def beta(self):

return self.tau * self.lambda_() * self.eps()

@bm.functional

def log_prob_test(self):

return (

dist.Bernoulli(logits=self.X_test @ self.beta()).log_prob(self.Y_test).sum()

)

@bm.functional

def y_probs(self):

return torch.sigmoid(self.X_test @ self.beta())

def __repr__(self):

return f"SparseLogisticRegression with {self.X_train.shape[1]} covariates"

## Dataβ

With the model defined, we need to collect some observed data in order to learn about values of interest in our model. For this tutorial, we'll use 1,000 five-dimensional data points where the response only depends on the first two, so that we can visualize what's going on.

We will generate a dataset where items have true label 0 and true label 1. For demonstrative purposes, we will use a synthetically generated dataset of observed values. In practice, you would gather real data and classify results by-hand, for example using human labelers.

`X = dist.Normal(0, 5).expand([1000, 5]).sample()`

tau0 = 2 * 0.3 / (0.7 * np.sqrt(1000))

lambda0 = torch.tensor([-70, 40, 0, 0, 0]).unsqueeze(-1)

true_beta = tau0 * lambda0

Y = dist.Bernoulli(logits=X @ true_beta).sample()

print(f"Ο0: {tau0}")

print(f" Ξ²: {true_beta}")

Let's take a moment to visualize our dataset.

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

red_x = X[:, :2][Y[:, 0] == 0, 0]

red_y = X[:, :2][Y[:, 0] == 0, 1]

red_label = ["red"] * len(red_x)

red_cds = ColumnDataSource(

{"x": red_x.tolist(), "y": red_y.tolist(), "label": red_label}

)

red_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]

green_x = X[:, :2][Y[:, 0] == 1, 0]

green_y = X[:, :2][Y[:, 0] == 1, 1]

green_label = ["green"] * len(green_x)

green_cds = ColumnDataSource(

{"x": green_x.tolist(), "y": green_y.tolist(), "label": green_label}

)

green_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]

synthetic_data_plot = plots.scatter_plot(

plot_sources=[red_cds, green_cds],

tooltips=[red_tips, green_tips],

figure_kwargs={

"title": "Synthetic data",

"x_axis_label": "x",

"y_axis_label": "y",

},

legend_items=["Label 0", "Label 1"],

plot_kwargs={"fill_color": "label", "fill_alpha": 0.5},

)

show(synthetic_data_plot)

Now, we will split the dataset into a training set and a test set.

`X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y)`

Now that we've got our data defined, we can instantiate an instance of the model.

`model = SparseLogisticRegression(X_train, X_test, Y_test)`

Our inference algorithms expect observations in the form of a dictionary. This
dictionary should consist of `@bm.random_variable`

invocations as keys, and tensor data
as values. In order to bind this data, first we'll instantiate the model.

`observations = {model.y(): Y_train}`

## Inferenceβ

Inference is the process of combining *model* with *data* to obtain *insights*, in the
form of probability distributions over values of interest. Bean Machine offers a
powerful and general inference framework to enable fitting arbitrary models to data.

Since this model is comprised entirely of differentiable random variables, we'll make use of the No U-Turn Sampler (NUTS) (Hoffman & Gelman, 2011).

Running inference consists of a few arguments:

Name | Usage |
---|---|

`queries` | List of `@bm.random_variable` targets to fit posterior distributions for. |

`observations` | A dictionary of observations, as built above. |

`num_samples` | Number of Monte Carlo samples to approximate the posterior distributions for the variables in `queries` . |

`num_chains` | Number of separate inference runs to use. Multiple chains can help verify that inference ran correctly. |

Let's run inference:

`num_samples = 2 if smoke_test else 1000`

num_adaptive_samples = 0 if smoke_test else num_samples // 2

num_chains = 1 if smoke_test else 2

# Experimental backend using the Pytorch NNC compiler

nnc_compile = "SANDCASTLE_NEXUS" not in os.environ

`queries = [model.lambda_(), model.beta(), model.log_prob_test(), model.y_probs()]`

samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(

queries=queries,

observations=observations,

num_adaptive_samples=num_adaptive_samples,

num_samples=num_samples,

num_chains=num_chains,

)

## Analysisβ

`samples`

now contains our inference results.

First, we'll just print previews of the results. This should give a sense of how to work
with the `samples`

object, and also an idea of the shapes of the inferred values.

`lambda_marginal = samples[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()`

beta_marginal = samples[model.beta()].flatten(start_dim=0, end_dim=1).detach()

log_prob_test_results = samples[model.log_prob_test()][0].detach()

print(

f"lambda_marginal:\n{lambda_marginal}\n\n"

f"beta_marginal:\n{beta_marginal}\n\n"

f"log_prob_test_results:\n{log_prob_test_results[:20]}"

)

Next, let's visualize the inferred random variables.

Below we plot the joint of the first two components of $\beta$ and $\lambda$ (as a reminder, the remaining components are noise which the outcome variable doesn't depend on).

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

beta_marginals_plot = plots.marginal_2d(

x=beta_marginal[:, :2][:, 0].flatten().numpy(),

y=beta_marginal[:, :2][:, 1].flatten().numpy(),

x_label="Ξ²0",

y_label="Ξ²1",

title="Ξ²0 - Ξ²1 joint plot",

true_x=float(true_beta[:2, :].flatten()[0]),

true_y=float(true_beta[:2, :].flatten()[1]),

)

show(beta_marginals_plot)

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

lambda_marginals_plot = plots.marginal_2d(

x=lambda_marginal[:, :2][:, 0].flatten().numpy(),

y=lambda_marginal[:, :2][:, 1].flatten().numpy(),

x_label="Ξ»0",

y_label="Ξ»1",

title="Ξ»0 - Ξ»1 joint plot",

)

show(lambda_marginals_plot)

The marginals and the joint for $\beta$ and $\lambda$ look reasonable. We note that
posterior mean for $\beta$ closely matches the true values used to generate the data.
Note that the $\lambda$ values can be fairly large if needed since we place a
`HalfCauchy`

prior over it. Even in the presence of strong global shrinkage, this lets
local parameters take on larger values if needed.

Now let us look at the histogram for the remaining values of $\beta$, i.e. the coefficients for variables which have no effect on the outcome.

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

betas = samples[model.beta()].squeeze(-1).reshape(-1, 5)

plot_sources = []

labels = []

tooltips = []

for i in range(5):

b = betas[:, i].numpy()

support, density = az.stats.density_utils.kde(b)

density /= density.max()

cds = ColumnDataSource({"x": support.tolist(), "y": density.tolist()})

label = f"Ξ²{i}"

tips = [(f"{label}", "@support{0.000}")]

plot_sources.append(cds)

labels.append(label)

tooltips.append(tips)

betas_density_plot = plots.line_plot(

plot_sources=plot_sources,

labels=labels,

tooltips=tooltips,

plot_kwargs={"line_width": 2},

figure_kwargs={

"title": "Histogram of some sample model coefficients",

"x_axis_label": "Ξ²",

},

)

betas_density_plot.yaxis.visible = False

betas_density_plot.legend.click_policy = "mute"

show(betas_density_plot)

We note that the majority of the posterior mass on all the variables that the outcome doesn't depend on is centered at 0 and the only non-zero values are the first two variables. This validates how the horseshoe prior helps induce sparsity.

Lastly, let's plot the first thousand log probabilities on test data that we generated per-iteration during inference. We'll also overlay the log probability that the test dataset would score on the ground truth parameters.

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

ground_truth_log_prob = float(

dist.Bernoulli(logits=X_test @ true_beta).log_prob(Y_test).sum()

)

y = log_prob_test_results.numpy()

x = np.arange(len(y))

cds = ColumnDataSource({"x": x.tolist(), "y": y.tolist()})

tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]

log_prob_plot = plots.line_plot(

plot_sources=[cds],

labels=[f"Using true params = {ground_truth_log_prob :.2f}"],

tooltips=[tips],

figure_kwargs={

"title": "Log probability",

"plot_width": 800,

"x_axis_label": "Iteration",

"y_axis_label": "Log probability",

},

plot_kwargs={"line_alpha": 0.5},

)

# Add a line showing the ground truth.

span = Span(

location=ground_truth_log_prob,

dimension="width",

line_color="black",

line_width=3,

)

log_prob_plot.add_layout(span)

log_prob_plot.legend.location = "bottom_left"

show(log_prob_plot)

Let us plot the predictions from the model on the test dataset as a sanity check. For
simplicity, we will compute the mean of the probability score (computed as `y_probs`

in
the model) on each test data point and assume a label of 0 if the score is less than 0.5
and conversely a label of 1 if the score is above 0.5.

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

y_probs = samples[model.y_probs()].mean([0, 1])

predicted_red_x = X_test[y_probs[:, 0] < 0.5, 0]

predicted_red_y = X_test[y_probs[:, 0] < 0.5, 1]

predicted_red_label = ["red"] * len(predicted_red_x)

predicted_red_cds = ColumnDataSource(

{

"x": predicted_red_x.tolist(),

"y": predicted_red_y.tolist(),

"label": predicted_red_label,

}

)

predicted_red_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]

predicted_green_x = X_test[y_probs[:, 0] >= 0.5, 0]

predicted_green_y = X_test[y_probs[:, 0] >= 0.5, 1]

predicted_green_label = ["green"] * len(predicted_green_x)

predicted_green_cds = ColumnDataSource(

{

"x": predicted_green_x.tolist(),

"y": predicted_green_y.tolist(),

"label": predicted_green_label,

}

)

predicted_green_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]

predicted_data_plot = plots.scatter_plot(

plot_sources=[predicted_red_cds, predicted_green_cds],

tooltips=[predicted_red_tips, predicted_green_tips],

figure_kwargs={

"title": "Synthetic data",

"x_axis_label": "x",

"y_axis_label": "y",

},

legend_items=["Predicted 0", "Predicted 1"],

plot_kwargs={"fill_color": "label", "fill_alpha": 0.5},

)

# Select one point and annotate it.

predicted_data_plot.circle(

x=[float(X_test[11, 0])],

y=[float(X_test[11, 1])],

size=20,

fill_color=None,

line_color="steelblue",

line_width=2,

line_alpha=1,

)

predicted_data_plot.legend.location = "bottom_left"

show(predicted_data_plot)

As we can see, the model seems to have recovered the same decision boundary, and MCMC has converged to parameters that correctly predicted the test dataset. This was just a sanity check. Since we have built a probabilistic model, the probability that each data point has label 1 is itself a distribution! Let us see this for one of the test data points (marked by a blue circle above).

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

data_point11_plot = plots.histogram_plot(

data=samples[model.y_probs()][..., 11, 0].reshape(-1).numpy(),

figure_kwargs={"title": "Probability of test datapoint 11 having Label 1"},

)

show(data_point11_plot)

The `MonteCarloSamples`

from Bean Machine's inference can be converted to arviz's
internal `InferenceData`

format so that we can make use of its inference statistics and
plotting utilities for diagnostics.

`filtered_samples = {`

k: v for k, v in samples.items() if k in {model.beta(), model.lambda_()}

}

az_data = MonteCarloSamples(filtered_samples).to_inference_data()

summary_df = az.summary(az_data, round_to=3)

Markdown(summary_df.to_markdown())

mean | sd | hdi_5.5% | hdi_94.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|

beta(SparseLogisticRegression with 5 covariates,)[0, 0] | -1.834 | 0.195 | -2.112 | -1.506 | 0.004 | 0.003 | 1928.09 | 1492.71 | 0.999 |

beta(SparseLogisticRegression with 5 covariates,)[1, 0] | 1.124 | 0.121 | 0.939 | 1.314 | 0.003 | 0.002 | 1942.43 | 1465.61 | 1 |

beta(SparseLogisticRegression with 5 covariates,)[2, 0] | 0.02 | 0.03 | -0.027 | 0.068 | 0.001 | 0.001 | 1384.37 | 1871.51 | 1.002 |

beta(SparseLogisticRegression with 5 covariates,)[3, 0] | 0.035 | 0.033 | -0.01 | 0.09 | 0.001 | 0.001 | 826.091 | 1277.33 | 1.001 |

beta(SparseLogisticRegression with 5 covariates,)[4, 0] | -0.055 | 0.037 | -0.114 | 0.004 | 0.001 | 0.001 | 1291.37 | 1073.18 | 1 |

lambda_(SparseLogisticRegression with 5 covariates,)[0, 0] | 5.203 | 3.894 | 1.502 | 9.201 | 0.149 | 0.105 | 750.873 | 843.618 | 1.002 |

lambda_(SparseLogisticRegression with 5 covariates,)[1, 0] | 3.453 | 2.629 | 0.856 | 6.009 | 0.119 | 0.084 | 634.313 | 619.952 | 1.004 |

lambda_(SparseLogisticRegression with 5 covariates,)[2, 0] | 0.499 | 0.693 | 0.001 | 1.137 | 0.024 | 0.017 | 512.808 | 721.796 | 1.003 |

lambda_(SparseLogisticRegression with 5 covariates,)[3, 0] | 0.63 | 1.36 | 0 | 1.265 | 0.048 | 0.034 | 413.859 | 566.379 | 1.002 |

lambda_(SparseLogisticRegression with 5 covariates,)[4, 0] | 0.641 | 0.859 | 0 | 1.297 | 0.027 | 0.019 | 595.7 | 732.401 | 1.002 |

The summary above includes useful statistics about each marginal distribution. Below we describe what some of these useful statistics are.

#### $\hat{R}$ diagnosticβ

$\hat{R}$ is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then $\hat{R}\approx 1$, since the between-chain and within-chain variance should be equal. $\hat{R}$ is calculated as

where $W$ is the within-chain variance and $\hat{V}$ is the posterior variance estimate
for the pooled rank-traces. The take-away here is that $\hat{R}$ converges towards 1
when each of the Markov chains approaches perfect adaptation to the true posterior
distribution. We do not recommend using inference results if $\hat{R}>1.01$. More
information about $\hat{R}$ can be found in the Vehtari *et al* paper.

#### Effective sample size $ess$ diagnosticβ

MCMC samplers do not draw independent samples from the target distribution, which means
that our samples are correlated. In an ideal situation all samples would be independent,
but we do not have that luxury. We can, however, measure the number of *effectively
independent* samples we draw, which is called the effective sample size. You can read
more about how this value is calculated in the Vehtari *et al* paper,
briefly it is a measure that combines information from the $\hat{R}$ value with the
autocorrelation estimates within the chains. There are many ways to estimate effective
samples sizes, however, we will be using the method defined in the
Vehtari *et al* paper.

The rule of thumb for `ess_bulk`

is for this value to be greater than 100 per chain on
average. Since we ran four chains, we need `ess_bulk`

to be greater than 200 for each
parameter. The `ess_tail`

is an estimate for effectively independent samples considering
the more extreme values of the posterior. This is not the number of samples that landed
in the tails of the posterior. It is a measure of the number of effectively independent
samples if we sampled the tails of the posterior. The rule of thumb for this value is
also to be greater than 100 per chain on average.

For comparison, let's check out the model's performance using a couple of different inference methods.

## Single-site Metropolis-Hastingsβ

Let's retry this problem, using ancestral Metropolis-Hastings as the inference algorithm to compare performance.

Ancestral Metropolis-Hastings is a simple inference algorithm, which proposes child random variables conditional on values for the parent random variables. The most ancestral random variables are simply sampled from the prior distribution.

`queries_mh = [model.lambda_(), model.beta(), model.log_prob_test()]`

num_samples_mh = 2 * num_samples

samples_mh = bm.SingleSiteAncestralMetropolisHastings().infer(

queries=queries_mh,

observations=observations,

num_samples=num_samples_mh,

num_chains=num_chains,

)

`lambda_marginal_mh = (`

samples_mh[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()

)

beta_marginal_mh = samples_mh[model.beta()].flatten(start_dim=0, end_dim=1).detach()

log_prob_test_results_mh = (

samples_mh[model.log_prob_test()].flatten(start_dim=0, end_dim=1).detach()

)

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

beta_marginals_mh_plot = plots.marginal_2d(

x=beta_marginal_mh[:, :2][:, 0].flatten().numpy(),

y=beta_marginal_mh[:, :2][:, 1].flatten().numpy(),

x_label="Ξ²0",

y_label="Ξ²1",

title="Ξ²0 - Ξ²1 joint plot / Single Site Ancestral Metropolis Hastings",

true_x=float(true_beta[:2, :].flatten()[0]),

true_y=float(true_beta[:2, :].flatten()[1]),

)

show(beta_marginals_mh_plot)

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

lambda_marginals_mh_plot = plots.marginal_2d(

x=lambda_marginal_mh[:, :2][:, 0].flatten().numpy(),

y=lambda_marginal_mh[:, :2][:, 1].flatten().numpy(),

x_label="Ξ»0",

y_label="Ξ»1",

title="Ξ»0 - Ξ»1 joint plot / Single Site Ancestral Metropolis Hastings",

)

show(lambda_marginals_mh_plot)

From all of the above plots, we see that ancestral Metropolis-Hastings does a
significantly worse job at recovering the true parameters! Not only do regions of
uncertainty tend to *exclude* the true values, the samples that are actually drawn are
very sparse. This means that the algorithm is achieving a very poor representation of
the posterior surface.

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

y = log_prob_test_results_mh.numpy()[:1000]

x = np.arange(len(y))

cds = ColumnDataSource({"x": x.tolist(), "y": y.tolist()})

tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]

log_prob_mh_plot = plots.line_plot(

plot_sources=[cds],

labels=[f"Using true params = {ground_truth_log_prob :.2f}"],

tooltips=[tips],

figure_kwargs={

"title": "Log probability",

"plot_width": 800,

"x_axis_label": "Iteration",

"y_axis_label": "Log probability",

},

plot_kwargs={"line_alpha": 0.5},

)

# Add a line showing the ground truth.

span = Span(

location=ground_truth_log_prob,

dimension="width",

line_color="black",

line_width=3,

)

log_prob_mh_plot.add_layout(span)

log_prob_mh_plot.legend.location = "bottom_left"

show(log_prob_mh_plot)

Surprisingly, the algorithm does eventually discover β and settle in on β parameters that seem to describe the test data well.

`summary_mh_df = az.summary(samples_mh.to_inference_data(), round_to=3)`

Markdown(summary_mh_df.to_markdown())

mean | sd | hdi_5.5% | hdi_94.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|

log_prob_test(SparseLogisticRegression with 5 covariates,) | -34.391 | 18.654 | -37.696 | -30.967 | 1.773 | 1.257 | 4.104 | 14.836 | 1.517 |

beta(SparseLogisticRegression with 5 covariates,)[0, 0] | -1.602 | 0.23 | -1.785 | -1.197 | 0.095 | 0.081 | 3.735 | 2.312 | 1.485 |

beta(SparseLogisticRegression with 5 covariates,)[1, 0] | 0.989 | 0.188 | 0.725 | 1.152 | 0.096 | 0.08 | 3.326 | 9.739 | 1.604 |

beta(SparseLogisticRegression with 5 covariates,)[2, 0] | 0 | 0.111 | -0.117 | 0.115 | 0.074 | 0.061 | 2.313 | 2.312 | 2.825 |

beta(SparseLogisticRegression with 5 covariates,)[3, 0] | 0.041 | 0.045 | 0.017 | 0.113 | 0.011 | 0.008 | 6.909 | 2.836 | 1.477 |

beta(SparseLogisticRegression with 5 covariates,)[4, 0] | -0.004 | 0.09 | -0.101 | 0.075 | 0.059 | 0.048 | 2.626 | 2.323 | 2.682 |

lambda_(SparseLogisticRegression with 5 covariates,)[0, 0] | 2.389 | 0.674 | 1.754 | 3.141 | 0.457 | 0.381 | 2.224 | 2.02 | 3.968 |

lambda_(SparseLogisticRegression with 5 covariates,)[1, 0] | 4.016 | 0.559 | 3.909 | 4.377 | 0.298 | 0.255 | 2.289 | 7.613 | 2.915 |

lambda_(SparseLogisticRegression with 5 covariates,)[2, 0] | 0.185 | 0.079 | 0.176 | 0.217 | 0.011 | 0.008 | 4.494 | 8.693 | 2.887 |

lambda_(SparseLogisticRegression with 5 covariates,)[3, 0] | 1.085 | 1.601 | 0.315 | 1.049 | 0.711 | 0.535 | 2.261 | 2.008 | 3.069 |

lambda_(SparseLogisticRegression with 5 covariates,)[4, 0] | 0.439 | 0.314 | 0.149 | 0.696 | 0.209 | 0.173 | 2.284 | 2.008 | 2.941 |

$\hat{R}$ values are extremely far from $1.0$. Further, the $ess$ and values for our sampled random variables are extremely small. As a result, these inference results would be unusable for any real application.

In comparison, NUTS seems to have developed a much more complete representation of the posterior surface. Now that we have validated the model on a synthetic dataset, we will look at a real-world example.

## German-numeric dataβ

In the above examples, we used a 2D dataset, simply to make visualizing the inferences more intuitive. Let's retry our sparse logistic regression on a real-world dataset, the German credit dataset.

The German credit dataset is a collection of 1,000 data points. Each data point represents a person who borrows from a bank. Each person is classified as either a good or a bad credit risk according to the bank. Each person contains 24 numeric covariates that may or may not be useful predictors for credit risk. Example covariates include age, sex, and savings. The response variable is either 1, indicating good credit, or 2, indicating bad credit. You can read more about this dataset, and in particular what each of the first 24 covariate columns represent in the documentation.

`input_data = np.genfromtxt(`

"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data-numeric"

)

Let's scale covariates to range between -1 and 1, and add a constant factor. This yields a total of 25 covariates. We'll also translate the response variable to [0, 1].

`X_all = torch.from_numpy(`

np.hstack(

[

np.ones((input_data.shape[0], 1)),

sklearn.preprocessing.minmax_scale(

input_data[:, :-1],

feature_range=(-1, 1),

),

]

)

).float()

Y_all = torch.from_numpy(sklearn.preprocessing.minmax_scale(input_data[:, -1:])).float()

Now, we'll split into training and test data.

`X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(`

X_all,

Y_all,

)

The rest of the setup is the same as the tutorial with synthetic data.

First, we'll instantiate our model.

`model = SparseLogisticRegression(X_train, X_test, Y_test)`

Next, we'll bind our observations.

`observations = {model.y(): Y_train}`

Now, we're ready to fit the model.

`german_queries = [`

model.lambda_(),

model.beta(),

model.log_prob_test(),

model.y_probs(),

]

german_samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(

queries=german_queries,

observations=observations,

num_adaptive_samples=num_adaptive_samples,

num_samples=num_samples,

num_chains=num_chains,

)

`german_lambda_marginal = (`

german_samples[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()

)

german_beta_marginal = (

german_samples[model.beta()].flatten(start_dim=0, end_dim=1).detach()

)

german_log_prob_test_results = (

german_samples[model.log_prob_test()].flatten(start_dim=0, end_dim=1).detach()

)

This model is too high-dimensional to plot the same way that we did with the previous examples. Let us look at the log density of the samples generated by MCMC. Sometimes the chains can get stuck making little progress, which can be seen from this plot. We observe that NUTS is able to draw samples quite efficiently from across the posterior distribution (compare with Metropolis-Hastings above).

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

german_y = german_log_prob_test_results[:1000]

german_x = np.arange(len(y))

german_cds = ColumnDataSource({"x": german_x.tolist(), "y": german_y.tolist()})

german_tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]

german_log_prob_plot = plots.line_plot(

plot_sources=[german_cds],

tooltips=[german_tips],

figure_kwargs={

"title": "Log probability for German credit risk data",

"plot_width": 800,

"x_axis_label": "Iteration",

"y_axis_label": "Log probability",

},

plot_kwargs={"line_alpha": 0.5},

)

show(german_log_prob_plot)

Let us also look at the diagnostics to ensure that the MCMC chains have converged to the posterior distribution. We note that $\hat{R}$ are all close to 1 and the $ess$ values look reasonable.

`filtered_samples = {`

k: v for k, v in german_samples.items() if k in {model.beta(), model.lambda_()}

}

az_data = MonteCarloSamples(filtered_samples).to_inference_data()

german_summary_df = az.summary(az_data, round_to=3)

Markdown(german_summary_df.to_markdown())

mean | sd | hdi_5.5% | hdi_94.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|

beta(SparseLogisticRegression with 25 covariates,)[0, 0] | -0.218 | 0.356 | -0.756 | 0.23 | 0.012 | 0.008 | 953.99 | 1267.42 | 1.003 |

beta(SparseLogisticRegression with 25 covariates,)[1, 0] | -0.782 | 0.12 | -0.965 | -0.588 | 0.003 | 0.002 | 1616.33 | 1357.34 | 1.001 |

beta(SparseLogisticRegression with 25 covariates,)[2, 0] | 1.347 | 0.299 | 0.867 | 1.819 | 0.007 | 0.005 | 1915.53 | 1591.96 | 1.002 |

beta(SparseLogisticRegression with 25 covariates,)[3, 0] | -0.475 | 0.205 | -0.799 | -0.156 | 0.006 | 0.005 | 984.341 | 782.178 | 1.001 |

beta(SparseLogisticRegression with 25 covariates,)[4, 0] | 0.057 | 0.189 | -0.189 | 0.386 | 0.004 | 0.003 | 1863.57 | 1583.56 | 1.001 |

beta(SparseLogisticRegression with 25 covariates,)[5, 0] | -0.291 | 0.144 | -0.537 | -0.07 | 0.004 | 0.003 | 1077.02 | 591.582 | 1.003 |

beta(SparseLogisticRegression with 25 covariates,)[6, 0] | -0.348 | 0.179 | -0.587 | 0.006 | 0.007 | 0.005 | 785.977 | 456.259 | 1 |

beta(SparseLogisticRegression with 25 covariates,)[7, 0] | -0.123 | 0.158 | -0.358 | 0.106 | 0.004 | 0.003 | 1294.97 | 1653.24 | 1.001 |

beta(SparseLogisticRegression with 25 covariates,)[8, 0] | -0.011 | 0.083 | -0.155 | 0.108 | 0.002 | 0.001 | 1682.91 | 1639.8 | 1 |

beta(SparseLogisticRegression with 25 covariates,)[9, 0] | 0.183 | 0.154 | -0.036 | 0.405 | 0.005 | 0.003 | 894.553 | 1343.35 | 1.001 |

beta(SparseLogisticRegression with 25 covariates,)[10, 0] | -0.068 | 0.147 | -0.348 | 0.106 | 0.004 | 0.003 | 1349.61 | 1360.98 | 1 |

beta(SparseLogisticRegression with 25 covariates,)[11, 0] | -0.214 | 0.128 | -0.391 | 0.011 | 0.004 | 0.003 | 1046.93 | 507.06 | 1.005 |

beta(SparseLogisticRegression with 25 covariates,)[12, 0] | 0.111 | 0.174 | -0.114 | 0.391 | 0.004 | 0.003 | 1545.6 | 1841.1 | 1.001 |

beta(SparseLogisticRegression with 25 covariates,)[13, 0] | 0.013 | 0.079 | -0.107 | 0.145 | 0.002 | 0.001 | 1694.56 | 1724.96 | 1.002 |

beta(SparseLogisticRegression with 25 covariates,)[14, 0] | -0.088 | 0.089 | -0.233 | 0.034 | 0.003 | 0.002 | 1096.82 | 1181.23 | 1.002 |

beta(SparseLogisticRegression with 25 covariates,)[15, 0] | -0.275 | 0.294 | -0.727 | 0.073 | 0.009 | 0.007 | 832.849 | 1529.49 | 1.007 |

beta(SparseLogisticRegression with 25 covariates,)[16, 0] | 0.243 | 0.123 | 0.047 | 0.446 | 0.005 | 0.003 | 686.886 | 199.985 | 1.003 |

beta(SparseLogisticRegression with 25 covariates,)[17, 0] | -0.31 | 0.202 | -0.594 | 0.018 | 0.007 | 0.005 | 717.282 | 510.955 | 1.002 |

beta(SparseLogisticRegression with 25 covariates,)[18, 0] | 0.221 | 0.212 | -0.043 | 0.56 | 0.007 | 0.005 | 629.229 | 1192.16 | 1.009 |

beta(SparseLogisticRegression with 25 covariates,)[19, 0] | 0.263 | 0.285 | -0.104 | 0.689 | 0.009 | 0.006 | 781.196 | 1412.12 | 1.008 |

beta(SparseLogisticRegression with 25 covariates,)[20, 0] | 0.151 | 0.138 | -0.056 | 0.359 | 0.004 | 0.003 | 1081.01 | 1508.05 | 1 |

beta(SparseLogisticRegression with 25 covariates,)[21, 0] | -0.062 | 0.108 | -0.253 | 0.082 | 0.003 | 0.002 | 929.626 | 1422.19 | 1.002 |

beta(SparseLogisticRegression with 25 covariates,)[22, 0] | 0.015 | 0.14 | -0.192 | 0.249 | 0.004 | 0.003 | 1312.97 | 809.934 | 1.003 |

beta(SparseLogisticRegression with 25 covariates,)[23, 0] | 0.021 | 0.086 | -0.117 | 0.153 | 0.002 | 0.001 | 1606.22 | 1499.37 | 1.001 |

beta(SparseLogisticRegression with 25 covariates,)[24, 0] | -0.016 | 0.072 | -0.131 | 0.1 | 0.002 | 0.002 | 886.206 | 968.057 | 1.001 |

lambda_(SparseLogisticRegression with 25 covariates,)[0, 0] | 2.101 | 2.943 | 0 | 4.795 | 0.098 | 0.07 | 1039.31 | 1018.09 | 1.002 |

lambda_(SparseLogisticRegression with 25 covariates,)[1, 0] | 5.307 | 4.308 | 1.269 | 9.595 | 0.158 | 0.112 | 726.059 | 940.254 | 1 |

lambda_(SparseLogisticRegression with 25 covariates,)[2, 0] | 8.785 | 7.265 | 1.756 | 16.159 | 0.258 | 0.183 | 627.759 | 1031.12 | 1.004 |

lambda_(SparseLogisticRegression with 25 covariates,)[3, 0] | 3.632 | 4.056 | 0.276 | 7.088 | 0.131 | 0.093 | 695.994 | 859.273 | 1.006 |

lambda_(SparseLogisticRegression with 25 covariates,)[4, 0] | 1.294 | 1.845 | 0.001 | 2.795 | 0.044 | 0.033 | 1544.61 | 963.183 | 1.003 |

lambda_(SparseLogisticRegression with 25 covariates,)[5, 0] | 2.592 | 3.226 | 0.005 | 4.682 | 0.12 | 0.085 | 768.199 | 695.669 | 1 |

lambda_(SparseLogisticRegression with 25 covariates,)[6, 0] | 2.836 | 3.357 | 0.003 | 5.179 | 0.117 | 0.083 | 603.47 | 590.562 | 0.999 |

lambda_(SparseLogisticRegression with 25 covariates,)[7, 0] | 1.415 | 2.102 | 0.005 | 2.845 | 0.06 | 0.046 | 912.985 | 1057.83 | 1.006 |

lambda_(SparseLogisticRegression with 25 covariates,)[8, 0] | 0.951 | 1.472 | 0.002 | 1.942 | 0.072 | 0.051 | 842.951 | 239.897 | 1.003 |

lambda_(SparseLogisticRegression with 25 covariates,)[9, 0] | 1.732 | 2.288 | 0.001 | 3.345 | 0.07 | 0.049 | 692.473 | 735.36 | 1.003 |

lambda_(SparseLogisticRegression with 25 covariates,)[10, 0] | 1.321 | 2.268 | 0.001 | 2.605 | 0.066 | 0.047 | 1168.43 | 1075.63 | 1.001 |

lambda_(SparseLogisticRegression with 25 covariates,)[11, 0] | 1.797 | 1.908 | 0.015 | 3.392 | 0.06 | 0.042 | 667.351 | 630.821 | 1.002 |

lambda_(SparseLogisticRegression with 25 covariates,)[12, 0] | 1.417 | 2.074 | 0.003 | 2.941 | 0.059 | 0.045 | 1363.8 | 1021.44 | 1.005 |

lambda_(SparseLogisticRegression with 25 covariates,)[13, 0] | 0.91 | 1.238 | 0 | 1.85 | 0.033 | 0.023 | 1213.41 | 759.393 | 1.005 |

lambda_(SparseLogisticRegression with 25 covariates,)[14, 0] | 1.161 | 1.607 | 0.001 | 2.331 | 0.047 | 0.033 | 912.611 | 700.183 | 1.001 |

lambda_(SparseLogisticRegression with 25 covariates,)[15, 0] | 2.245 | 2.497 | 0.001 | 5.206 | 0.072 | 0.051 | 844.69 | 724.233 | 1.007 |

lambda_(SparseLogisticRegression with 25 covariates,)[16, 0] | 2.209 | 2.714 | 0.023 | 4.023 | 0.079 | 0.056 | 952.933 | 671.074 | 1.002 |

lambda_(SparseLogisticRegression with 25 covariates,)[17, 0] | 2.508 | 2.787 | 0.003 | 4.857 | 0.103 | 0.073 | 544.242 | 342.13 | 1.002 |

lambda_(SparseLogisticRegression with 25 covariates,)[18, 0] | 2.11 | 2.697 | 0.007 | 4.187 | 0.079 | 0.056 | 653.669 | 699.592 | 1.008 |

lambda_(SparseLogisticRegression with 25 covariates,)[19, 0] | 2.559 | 4.116 | 0.001 | 4.916 | 0.418 | 0.383 | 279.064 | 212.112 | 1.014 |

lambda_(SparseLogisticRegression with 25 covariates,)[20, 0] | 1.568 | 1.894 | 0.002 | 3.097 | 0.06 | 0.042 | 661.574 | 711.995 | 1.002 |

lambda_(SparseLogisticRegression with 25 covariates,)[21, 0] | 1.053 | 1.329 | 0.002 | 2.09 | 0.036 | 0.025 | 1098.9 | 798.229 | 1.002 |

lambda_(SparseLogisticRegression with 25 covariates,)[22, 0] | 1.18 | 1.706 | 0.003 | 2.394 | 0.056 | 0.043 | 1126.2 | 1024.51 | 1.003 |

lambda_(SparseLogisticRegression with 25 covariates,)[23, 0] | 0.917 | 1.438 | 0 | 1.87 | 0.042 | 0.032 | 958.948 | 792.467 | 1.003 |

lambda_(SparseLogisticRegression with 25 covariates,)[24, 0] | 0.86 | 1.369 | 0.002 | 1.75 | 0.044 | 0.031 | 1157.46 | 1089.54 | 1.002 |

To check for sparsity, let us plot the histogram of the marginals for a selection of coefficients ($\beta$) in the model. We note that many of the parameters have their posterior probability mass concentrated at 0, resulting in a sparse model, as we desired.

`# Required for visualizing in Colab.`

output_notebook(hide_banner=True)

german_betas = german_samples[model.beta()].squeeze(-1).reshape(-1, 25)

german_plot_sources = []

german_labels = []

german_tooltips = []

for i in range(25):

b = german_betas[:, i].numpy()

support, density = az.stats.density_utils.kde(b)

density /= density.max()

cds = ColumnDataSource({"x": support.tolist(), "y": density.tolist()})

label = f"Ξ²{i}"

tips = [(f"{label}", "@support{0.000}")]

german_plot_sources.append(cds)

german_labels.append(label)

german_tooltips.append(tips)

german_betas_density_plot = plots.line_plot(

plot_sources=german_plot_sources,

labels=german_labels,

tooltips=german_tooltips,

plot_kwargs={"line_width": 1, "line_alpha": 0.4},

figure_kwargs={

"title": "Histogram of some sample model coefficients",

"x_axis_label": "Ξ²",

"plot_height": 700,

},

)

german_betas_density_plot.yaxis.visible = False

german_betas_density_plot.legend.click_policy = "mute"

show(german_betas_density_plot)

Finally, let us evaluate the mean predictions from the model using the ROC (Receiver Operating Characteristic) plot. The area under the curve (AUC) is 0.83 which shows that the model is able to achieve a good classification accuracy. This is merely evaluating the mean forecast from the model, but as we saw earlier, we have a full distribution of the probability scores corresponding to each test data.

`y_true = Y_test.squeeze(-1).numpy()`

y_pred_mean = german_samples[model.y_probs()].squeeze(-1).mean([0, 1]).numpy()

false_positive_rate, true_positive_rate, _ = sklearn.metrics._ranking.roc_curve(

y_true,

y_pred_mean,

)

roc_score = sklearn.metrics.roc_auc_score(y_true, y_pred_mean)

roc_cds = ColumnDataSource(

{

"x": false_positive_rate.tolist(),

"y": true_positive_rate.tolist(),

}

)

roc_tips = [("TPF", "@y{0.000}"), ("FPR", "@x{0.000}")]

roc_plot = plots.line_plot(

plot_sources=[roc_cds],

tooltips=[roc_tips],

labels=[f"Classifier (AUC = {roc_score:.2f})"],

figure_kwargs={

"title": "Receiver Operator Characteristic",

"x_axis_label": "False positive rate (positive label: 1)",

"y_axis_label": "True positive rate (positive label: 1)",

},

plot_kwargs={

"line_width": 3,

"line_alpha": 0.7,

"hover_line_color": "orange",

"hover_line_alpha": 1,

},

)

roc_plot.legend.location = "bottom_right"

show(roc_plot)

# References

- Carvalho CM, Polson NG, Scott JG.
**Handling sparsity via the horseshoe**. In: van Dyk D, Welling M, editors.*Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics*; 2009. pp. 73β80. Available from http://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf. - Piironen J, Vehtari A.
**Sparsity information and regularization in the horseshoe and other shrinkage priors**.*Electronic Journal of Statistics*. 2017;11(2) 5018β5051. doi: 10.1214/17-EJS1337SI.