Sparse logistic regression
Tutorial: Sparse Logistic Regressionβ
This tutorial demonstrates modeling and running inference on a sparse logistic
regression model in Bean Machine. This tutorial showcases the inference techniques in
Bean Machine, and applies the model to a public dataset to evaluate performance. It also
introduces the @bm.functional
decorator, which can be used to deterministically
transform random variables which can be convenient for post-processing.
Problemβ
Logistic regression is a commonly used statistical method that allows us to predict a binary output from a set of independent variables. The sparse logistic regression is a type of logistic regression model which embeds feature selection in classification by adding overall and per-dimension scale factors. It is very applicable when dealing with high-dimensional data, such as classifying credit scores.
Sparse logistic regression is a hierarchical model with a sparse prior. We will use a horseshoe prior, Carvalho CM, Polson NG, Scott JG which induces sparsity using a combination of a global shrinkage scale factor that pushes the posterior mass of most model parameters to zero and a local scale that allows some of them to escape shrinkage.
Prerequisitesβ
We will be using the following packages within this tutorial.
- arviz and bokeh for interactive visualizations;
- pandas, numpy, and scikit-learn for data manipulation.
Please import the following code packages for the rest of the code in the tutorial to work.
# Install Bean Machine in Colab if using Colab.
import sys
if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
!pip install beanmachine
import os
import warnings
import arviz as az
import beanmachine.ppl as bm
import numpy as np
import sklearn
import sklearn.model_selection
import torch
import torch.distributions as dist
from beanmachine.ppl.inference.monte_carlo_samples import MonteCarloSamples
from beanmachine.tutorials.utils import plots
from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource, Span
from bokeh.plotting import show
from IPython.display import Markdown
from sklearn.metrics import RocCurveDisplay
The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for reproducibility.
# Eliminate excess UserWarnings from ArviZ.
warnings.filterwarnings("ignore")
# Plotting settings
az.rcParams["plot.backend"] = "bokeh"
az.rcParams["stats.hdi_prob"] = 0.89
# Manual seed
bm.seed(17)
# Other settings for the notebook
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ
Modelβ
We have the following definitions.
- : Size of the dataset.
- : Number of features of the dataset.
- : Global shrinkage of the model (input from the user).
- : Coefficient corresponding to dimension .
- : Local shrinkage for the coefficient corresponding to dimension .
The model is defined mathematically as follows:
A few notes:
- is the logistic function. Its purpose is to translate an unconstrained score predicted by the model into a probability .
- can be an input from the user which depends on the expected number of non-zeros
coefficients in the model. Alternatively, we can have a full Bayesian treatment of
as suggested by Piironen and Vehtari. For simplicity, this is
simply a constant in this tutorial but this can be replaced with a
HalfNormal
prior with the same scale as . This parameter is responsible for global shrinkage, whereas tends to shrink the influence of but since the Cauchy distribution has fat tails, it can also help coefficients escape shrinkage.
We can implement this model in Bean Machine by defining random variable objects with the
@bm.random_variable
decorator. These functions behave differently than ordinary Python
functions.
@bm.random_variable
functions:- They must return PyTorch
Distribution
objects. - Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
- Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
- Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
- Please see the documentation for more information about this decorator.
@bm.functional
:- This decorator is used to deterministically transform the results of one or more random variables.
To implement sparse logistic regression model in Bean Machine, we provide
@bm.random_variable
definitions for , , and . The value of
is estimated from the number of expected non-zero coefficients and is an input from the
user.
This is all you have to do to define the model. However, we'll also make use of the
@bm.functional
decorator to make it very easy to store the predictions (rather than
the binary outcome, we'll store the probability score of being labeled 1) on the test
data and for computing the log likelihood. This decorator has the same semantics as
@bm.random_variable
, except that it does not return a distribution. Instead, it
returns a deterministically-computed function from other random variables. It can be
used to conveniently compute values that would typically be computed in a
post-processing pass. Here, we use it to compute the log probability of test data, using
inferences made on training data.
class SparseLogisticRegression:
def __init__(self, X_train, X_test, Y_test, nonzero_frac=0.3):
self.X_train = X_train
self.X_test = X_test
self.Y_test = Y_test
# See: Piironen and Vehtari
self.tau = (
2 * nonzero_frac / ((1 - nonzero_frac) * np.sqrt(self.X_train.shape[1]))
)
@bm.random_variable
def lambda_(self):
return dist.HalfCauchy(1.0).expand([self.X_train.shape[1], 1])
@bm.random_variable
def eps(self):
return dist.Normal(0, 1).expand([self.X_train.shape[1], 1])
@bm.random_variable
def y(self):
return dist.Bernoulli(logits=self.X_train @ self.beta())
@bm.functional
def beta(self):
return self.tau * self.lambda_() * self.eps()
@bm.functional
def log_prob_test(self):
return (
dist.Bernoulli(logits=self.X_test @ self.beta()).log_prob(self.Y_test).sum()
)
@bm.functional
def y_probs(self):
return torch.sigmoid(self.X_test @ self.beta())
def __repr__(self):
return f"SparseLogisticRegression with {self.X_train.shape[1]} covariates"
Dataβ
With the model defined, we need to collect some observed data in order to learn about values of interest in our model. For this tutorial, we'll use 1,000 five-dimensional data points where the response only depends on the first two, so that we can visualize what's going on.
We will generate a dataset where items have true label 0 and true label 1. For demonstrative purposes, we will use a synthetically generated dataset of observed values. In practice, you would gather real data and classify results by-hand, for example using human labelers.
X = dist.Normal(0, 5).expand([1000, 5]).sample()
tau0 = 2 * 0.3 / (0.7 * np.sqrt(1000))
lambda0 = torch.tensor([-70, 40, 0, 0, 0]).unsqueeze(-1)
true_beta = tau0 * lambda0
Y = dist.Bernoulli(logits=X @ true_beta).sample()
print(f"Ο0: {tau0}")
print(f" Ξ²: {true_beta}")
Let's take a moment to visualize our dataset.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
red_x = X[:, :2][Y[:, 0] == 0, 0]
red_y = X[:, :2][Y[:, 0] == 0, 1]
red_label = ["red"] * len(red_x)
red_cds = ColumnDataSource(
{"x": red_x.tolist(), "y": red_y.tolist(), "label": red_label}
)
red_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]
green_x = X[:, :2][Y[:, 0] == 1, 0]
green_y = X[:, :2][Y[:, 0] == 1, 1]
green_label = ["green"] * len(green_x)
green_cds = ColumnDataSource(
{"x": green_x.tolist(), "y": green_y.tolist(), "label": green_label}
)
green_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]
synthetic_data_plot = plots.scatter_plot(
plot_sources=[red_cds, green_cds],
tooltips=[red_tips, green_tips],
figure_kwargs={
"title": "Synthetic data",
"x_axis_label": "x",
"y_axis_label": "y",
},
legend_items=["Label 0", "Label 1"],
plot_kwargs={"fill_color": "label", "fill_alpha": 0.5},
)
show(synthetic_data_plot)
Now, we will split the dataset into a training set and a test set.
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y)
Now that we've got our data defined, we can instantiate an instance of the model.
model = SparseLogisticRegression(X_train, X_test, Y_test)
Our inference algorithms expect observations in the form of a dictionary. This
dictionary should consist of @bm.random_variable
invocations as keys, and tensor data
as values. In order to bind this data, first we'll instantiate the model.
observations = {model.y(): Y_train}
Inferenceβ
Inference is the process of combining model with data to obtain insights, in the form of probability distributions over values of interest. Bean Machine offers a powerful and general inference framework to enable fitting arbitrary models to data.
Since this model is comprised entirely of differentiable random variables, we'll make use of the No U-Turn Sampler (NUTS) (Hoffman & Gelman, 2011).
Running inference consists of a few arguments:
Name | Usage |
---|---|
queries | List of @bm.random_variable targets to fit posterior distributions for. |
observations | A dictionary of observations, as built above. |
num_samples | Number of Monte Carlo samples to approximate the posterior distributions for the variables in queries . |
num_chains | Number of separate inference runs to use. Multiple chains can help verify that inference ran correctly. |
Let's run inference:
num_samples = 2 if smoke_test else 1000
num_adaptive_samples = 0 if smoke_test else num_samples // 2
num_chains = 1 if smoke_test else 2
# Experimental backend using the Pytorch NNC compiler
nnc_compile = "SANDCASTLE_NEXUS" not in os.environ
queries = [model.lambda_(), model.beta(), model.log_prob_test(), model.y_probs()]
samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(
queries=queries,
observations=observations,
num_adaptive_samples=num_adaptive_samples,
num_samples=num_samples,
num_chains=num_chains,
)
Analysisβ
samples
now contains our inference results.
First, we'll just print previews of the results. This should give a sense of how to work
with the samples
object, and also an idea of the shapes of the inferred values.
lambda_marginal = samples[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()
beta_marginal = samples[model.beta()].flatten(start_dim=0, end_dim=1).detach()
log_prob_test_results = samples[model.log_prob_test()][0].detach()
print(
f"lambda_marginal:\n{lambda_marginal}\n\n"
f"beta_marginal:\n{beta_marginal}\n\n"
f"log_prob_test_results:\n{log_prob_test_results[:20]}"
)
Next, let's visualize the inferred random variables.
Below we plot the joint of the first two components of and (as a reminder, the remaining components are noise which the outcome variable doesn't depend on).
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
beta_marginals_plot = plots.marginal_2d(
x=beta_marginal[:, :2][:, 0].flatten().numpy(),
y=beta_marginal[:, :2][:, 1].flatten().numpy(),
x_label="Ξ²0",
y_label="Ξ²1",
title="Ξ²0 - Ξ²1 joint plot",
true_x=float(true_beta[:2, :].flatten()[0]),
true_y=float(true_beta[:2, :].flatten()[1]),
)
show(beta_marginals_plot)
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
lambda_marginals_plot = plots.marginal_2d(
x=lambda_marginal[:, :2][:, 0].flatten().numpy(),
y=lambda_marginal[:, :2][:, 1].flatten().numpy(),
x_label="Ξ»0",
y_label="Ξ»1",
title="Ξ»0 - Ξ»1 joint plot",
)
show(lambda_marginals_plot)
The marginals and the joint for and look reasonable. We note that
posterior mean for closely matches the true values used to generate the data.
Note that the values can be fairly large if needed since we place a
HalfCauchy
prior over it. Even in the presence of strong global shrinkage, this lets
local parameters take on larger values if needed.
Now let us look at the histogram for the remaining values of , i.e. the coefficients for variables which have no effect on the outcome.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
betas = samples[model.beta()].squeeze(-1).reshape(-1, 5)
plot_sources = []
labels = []
tooltips = []
for i in range(5):
b = betas[:, i].numpy()
support, density = az.stats.density_utils.kde(b)
density /= density.max()
cds = ColumnDataSource({"x": support.tolist(), "y": density.tolist()})
label = f"Ξ²{i}"
tips = [(f"{label}", "@support{0.000}")]
plot_sources.append(cds)
labels.append(label)
tooltips.append(tips)
betas_density_plot = plots.line_plot(
plot_sources=plot_sources,
labels=labels,
tooltips=tooltips,
plot_kwargs={"line_width": 2},
figure_kwargs={
"title": "Histogram of some sample model coefficients",
"x_axis_label": "Ξ²",
},
)
betas_density_plot.yaxis.visible = False
betas_density_plot.legend.click_policy = "mute"
show(betas_density_plot)
We note that the majority of the posterior mass on all the variables that the outcome doesn't depend on is centered at 0 and the only non-zero values are the first two variables. This validates how the horseshoe prior helps induce sparsity.
Lastly, let's plot the first thousand log probabilities on test data that we generated per-iteration during inference. We'll also overlay the log probability that the test dataset would score on the ground truth parameters.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
ground_truth_log_prob = float(
dist.Bernoulli(logits=X_test @ true_beta).log_prob(Y_test).sum()
)
y = log_prob_test_results.numpy()
x = np.arange(len(y))
cds = ColumnDataSource({"x": x.tolist(), "y": y.tolist()})
tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]
log_prob_plot = plots.line_plot(
plot_sources=[cds],
labels=[f"Using true params = {ground_truth_log_prob :.2f}"],
tooltips=[tips],
figure_kwargs={
"title": "Log probability",
"plot_width": 800,
"x_axis_label": "Iteration",
"y_axis_label": "Log probability",
},
plot_kwargs={"line_alpha": 0.5},
)
# Add a line showing the ground truth.
span = Span(
location=ground_truth_log_prob,
dimension="width",
line_color="black",
line_width=3,
)
log_prob_plot.add_layout(span)
log_prob_plot.legend.location = "bottom_left"
show(log_prob_plot)
Let us plot the predictions from the model on the test dataset as a sanity check. For
simplicity, we will compute the mean of the probability score (computed as y_probs
in
the model) on each test data point and assume a label of 0 if the score is less than 0.5
and conversely a label of 1 if the score is above 0.5.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
y_probs = samples[model.y_probs()].mean([0, 1])
predicted_red_x = X_test[y_probs[:, 0] < 0.5, 0]
predicted_red_y = X_test[y_probs[:, 0] < 0.5, 1]
predicted_red_label = ["red"] * len(predicted_red_x)
predicted_red_cds = ColumnDataSource(
{
"x": predicted_red_x.tolist(),
"y": predicted_red_y.tolist(),
"label": predicted_red_label,
}
)
predicted_red_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]
predicted_green_x = X_test[y_probs[:, 0] >= 0.5, 0]
predicted_green_y = X_test[y_probs[:, 0] >= 0.5, 1]
predicted_green_label = ["green"] * len(predicted_green_x)
predicted_green_cds = ColumnDataSource(
{
"x": predicted_green_x.tolist(),
"y": predicted_green_y.tolist(),
"label": predicted_green_label,
}
)
predicted_green_tips = [("y", "@y{0.000}"), ("x", "@x{0.000}"), ("Label", "@label")]
predicted_data_plot = plots.scatter_plot(
plot_sources=[predicted_red_cds, predicted_green_cds],
tooltips=[predicted_red_tips, predicted_green_tips],
figure_kwargs={
"title": "Synthetic data",
"x_axis_label": "x",
"y_axis_label": "y",
},
legend_items=["Predicted 0", "Predicted 1"],
plot_kwargs={"fill_color": "label", "fill_alpha": 0.5},
)
# Select one point and annotate it.
predicted_data_plot.circle(
x=[float(X_test[11, 0])],
y=[float(X_test[11, 1])],
size=20,
fill_color=None,
line_color="steelblue",
line_width=2,
line_alpha=1,
)
predicted_data_plot.legend.location = "bottom_left"
show(predicted_data_plot)
As we can see, the model seems to have recovered the same decision boundary, and MCMC has converged to parameters that correctly predicted the test dataset. This was just a sanity check. Since we have built a probabilistic model, the probability that each data point has label 1 is itself a distribution! Let us see this for one of the test data points (marked by a blue circle above).
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
data_point11_plot = plots.histogram_plot(
data=samples[model.y_probs()][..., 11, 0].reshape(-1).numpy(),
figure_kwargs={"title": "Probability of test datapoint 11 having Label 1"},
)
show(data_point11_plot)
The MonteCarloSamples
from Bean Machine's inference can be converted to arviz's
internal InferenceData
format so that we can make use of its inference statistics and
plotting utilities for diagnostics.
filtered_samples = {
k: v for k, v in samples.items() if k in {model.beta(), model.lambda_()}
}
az_data = MonteCarloSamples(filtered_samples).to_inference_data()
summary_df = az.summary(az_data, round_to=3)
Markdown(summary_df.to_markdown())
mean | sd | hdi_5.5% | hdi_94.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
beta(SparseLogisticRegression with 5 covariates,)[0, 0] | -1.834 | 0.195 | -2.112 | -1.506 | 0.004 | 0.003 | 1928.09 | 1492.71 | 0.999 |
beta(SparseLogisticRegression with 5 covariates,)[1, 0] | 1.124 | 0.121 | 0.939 | 1.314 | 0.003 | 0.002 | 1942.43 | 1465.61 | 1 |
beta(SparseLogisticRegression with 5 covariates,)[2, 0] | 0.02 | 0.03 | -0.027 | 0.068 | 0.001 | 0.001 | 1384.37 | 1871.51 | 1.002 |
beta(SparseLogisticRegression with 5 covariates,)[3, 0] | 0.035 | 0.033 | -0.01 | 0.09 | 0.001 | 0.001 | 826.091 | 1277.33 | 1.001 |
beta(SparseLogisticRegression with 5 covariates,)[4, 0] | -0.055 | 0.037 | -0.114 | 0.004 | 0.001 | 0.001 | 1291.37 | 1073.18 | 1 |
lambda_(SparseLogisticRegression with 5 covariates,)[0, 0] | 5.203 | 3.894 | 1.502 | 9.201 | 0.149 | 0.105 | 750.873 | 843.618 | 1.002 |
lambda_(SparseLogisticRegression with 5 covariates,)[1, 0] | 3.453 | 2.629 | 0.856 | 6.009 | 0.119 | 0.084 | 634.313 | 619.952 | 1.004 |
lambda_(SparseLogisticRegression with 5 covariates,)[2, 0] | 0.499 | 0.693 | 0.001 | 1.137 | 0.024 | 0.017 | 512.808 | 721.796 | 1.003 |
lambda_(SparseLogisticRegression with 5 covariates,)[3, 0] | 0.63 | 1.36 | 0 | 1.265 | 0.048 | 0.034 | 413.859 | 566.379 | 1.002 |
lambda_(SparseLogisticRegression with 5 covariates,)[4, 0] | 0.641 | 0.859 | 0 | 1.297 | 0.027 | 0.019 | 595.7 | 732.401 | 1.002 |
The summary above includes useful statistics about each marginal distribution. Below we describe what some of these useful statistics are.
diagnosticβ
is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then , since the between-chain and within-chain variance should be equal. is calculated as
where is the within-chain variance and is the posterior variance estimate for the pooled rank-traces. The take-away here is that converges towards 1 when each of the Markov chains approaches perfect adaptation to the true posterior distribution. We do not recommend using inference results if . More information about can be found in the Vehtari et al paper.
Effective sample size diagnosticβ
MCMC samplers do not draw independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper, briefly it is a measure that combines information from the value with the autocorrelation estimates within the chains. There are many ways to estimate effective samples sizes, however, we will be using the method defined in the Vehtari et al paper.
The rule of thumb for ess_bulk
is for this value to be greater than 100 per chain on
average. Since we ran four chains, we need ess_bulk
to be greater than 200 for each
parameter. The ess_tail
is an estimate for effectively independent samples considering
the more extreme values of the posterior. This is not the number of samples that landed
in the tails of the posterior. It is a measure of the number of effectively independent
samples if we sampled the tails of the posterior. The rule of thumb for this value is
also to be greater than 100 per chain on average.
For comparison, let's check out the model's performance using a couple of different inference methods.
Single-site Metropolis-Hastingsβ
Let's retry this problem, using ancestral Metropolis-Hastings as the inference algorithm to compare performance.
Ancestral Metropolis-Hastings is a simple inference algorithm, which proposes child random variables conditional on values for the parent random variables. The most ancestral random variables are simply sampled from the prior distribution.
queries_mh = [model.lambda_(), model.beta(), model.log_prob_test()]
num_samples_mh = 2 * num_samples
samples_mh = bm.SingleSiteAncestralMetropolisHastings().infer(
queries=queries_mh,
observations=observations,
num_samples=num_samples_mh,
num_chains=num_chains,
)
lambda_marginal_mh = (
samples_mh[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()
)
beta_marginal_mh = samples_mh[model.beta()].flatten(start_dim=0, end_dim=1).detach()
log_prob_test_results_mh = (
samples_mh[model.log_prob_test()].flatten(start_dim=0, end_dim=1).detach()
)
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
beta_marginals_mh_plot = plots.marginal_2d(
x=beta_marginal_mh[:, :2][:, 0].flatten().numpy(),
y=beta_marginal_mh[:, :2][:, 1].flatten().numpy(),
x_label="Ξ²0",
y_label="Ξ²1",
title="Ξ²0 - Ξ²1 joint plot / Single Site Ancestral Metropolis Hastings",
true_x=float(true_beta[:2, :].flatten()[0]),
true_y=float(true_beta[:2, :].flatten()[1]),
)
show(beta_marginals_mh_plot)
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
lambda_marginals_mh_plot = plots.marginal_2d(
x=lambda_marginal_mh[:, :2][:, 0].flatten().numpy(),
y=lambda_marginal_mh[:, :2][:, 1].flatten().numpy(),
x_label="Ξ»0",
y_label="Ξ»1",
title="Ξ»0 - Ξ»1 joint plot / Single Site Ancestral Metropolis Hastings",
)
show(lambda_marginals_mh_plot)
From all of the above plots, we see that ancestral Metropolis-Hastings does a significantly worse job at recovering the true parameters! Not only do regions of uncertainty tend to exclude the true values, the samples that are actually drawn are very sparse. This means that the algorithm is achieving a very poor representation of the posterior surface.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
y = log_prob_test_results_mh.numpy()[:1000]
x = np.arange(len(y))
cds = ColumnDataSource({"x": x.tolist(), "y": y.tolist()})
tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]
log_prob_mh_plot = plots.line_plot(
plot_sources=[cds],
labels=[f"Using true params = {ground_truth_log_prob :.2f}"],
tooltips=[tips],
figure_kwargs={
"title": "Log probability",
"plot_width": 800,
"x_axis_label": "Iteration",
"y_axis_label": "Log probability",
},
plot_kwargs={"line_alpha": 0.5},
)
# Add a line showing the ground truth.
span = Span(
location=ground_truth_log_prob,
dimension="width",
line_color="black",
line_width=3,
)
log_prob_mh_plot.add_layout(span)
log_prob_mh_plot.legend.location = "bottom_left"
show(log_prob_mh_plot)
Surprisingly, the algorithm does eventually discover β and settle in on β parameters that seem to describe the test data well.
summary_mh_df = az.summary(samples_mh.to_inference_data(), round_to=3)
Markdown(summary_mh_df.to_markdown())
mean | sd | hdi_5.5% | hdi_94.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
log_prob_test(SparseLogisticRegression with 5 covariates,) | -34.391 | 18.654 | -37.696 | -30.967 | 1.773 | 1.257 | 4.104 | 14.836 | 1.517 |
beta(SparseLogisticRegression with 5 covariates,)[0, 0] | -1.602 | 0.23 | -1.785 | -1.197 | 0.095 | 0.081 | 3.735 | 2.312 | 1.485 |
beta(SparseLogisticRegression with 5 covariates,)[1, 0] | 0.989 | 0.188 | 0.725 | 1.152 | 0.096 | 0.08 | 3.326 | 9.739 | 1.604 |
beta(SparseLogisticRegression with 5 covariates,)[2, 0] | 0 | 0.111 | -0.117 | 0.115 | 0.074 | 0.061 | 2.313 | 2.312 | 2.825 |
beta(SparseLogisticRegression with 5 covariates,)[3, 0] | 0.041 | 0.045 | 0.017 | 0.113 | 0.011 | 0.008 | 6.909 | 2.836 | 1.477 |
beta(SparseLogisticRegression with 5 covariates,)[4, 0] | -0.004 | 0.09 | -0.101 | 0.075 | 0.059 | 0.048 | 2.626 | 2.323 | 2.682 |
lambda_(SparseLogisticRegression with 5 covariates,)[0, 0] | 2.389 | 0.674 | 1.754 | 3.141 | 0.457 | 0.381 | 2.224 | 2.02 | 3.968 |
lambda_(SparseLogisticRegression with 5 covariates,)[1, 0] | 4.016 | 0.559 | 3.909 | 4.377 | 0.298 | 0.255 | 2.289 | 7.613 | 2.915 |
lambda_(SparseLogisticRegression with 5 covariates,)[2, 0] | 0.185 | 0.079 | 0.176 | 0.217 | 0.011 | 0.008 | 4.494 | 8.693 | 2.887 |
lambda_(SparseLogisticRegression with 5 covariates,)[3, 0] | 1.085 | 1.601 | 0.315 | 1.049 | 0.711 | 0.535 | 2.261 | 2.008 | 3.069 |
lambda_(SparseLogisticRegression with 5 covariates,)[4, 0] | 0.439 | 0.314 | 0.149 | 0.696 | 0.209 | 0.173 | 2.284 | 2.008 | 2.941 |
values are extremely far from . Further, the and values for our sampled random variables are extremely small. As a result, these inference results would be unusable for any real application.
In comparison, NUTS seems to have developed a much more complete representation of the posterior surface. Now that we have validated the model on a synthetic dataset, we will look at a real-world example.
German-numeric dataβ
In the above examples, we used a 2D dataset, simply to make visualizing the inferences more intuitive. Let's retry our sparse logistic regression on a real-world dataset, the German credit dataset.
The German credit dataset is a collection of 1,000 data points. Each data point represents a person who borrows from a bank. Each person is classified as either a good or a bad credit risk according to the bank. Each person contains 24 numeric covariates that may or may not be useful predictors for credit risk. Example covariates include age, sex, and savings. The response variable is either 1, indicating good credit, or 2, indicating bad credit. You can read more about this dataset, and in particular what each of the first 24 covariate columns represent in the documentation.
input_data = np.genfromtxt(
"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data-numeric"
)
Let's scale covariates to range between -1 and 1, and add a constant factor. This yields a total of 25 covariates. We'll also translate the response variable to [0, 1].
X_all = torch.from_numpy(
np.hstack(
[
np.ones((input_data.shape[0], 1)),
sklearn.preprocessing.minmax_scale(
input_data[:, :-1],
feature_range=(-1, 1),
),
]
)
).float()
Y_all = torch.from_numpy(sklearn.preprocessing.minmax_scale(input_data[:, -1:])).float()
Now, we'll split into training and test data.
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(
X_all,
Y_all,
)
The rest of the setup is the same as the tutorial with synthetic data.
First, we'll instantiate our model.
model = SparseLogisticRegression(X_train, X_test, Y_test)
Next, we'll bind our observations.
observations = {model.y(): Y_train}
Now, we're ready to fit the model.
german_queries = [
model.lambda_(),
model.beta(),
model.log_prob_test(),
model.y_probs(),
]
german_samples = bm.GlobalNoUTurnSampler(nnc_compile=nnc_compile).infer(
queries=german_queries,
observations=observations,
num_adaptive_samples=num_adaptive_samples,
num_samples=num_samples,
num_chains=num_chains,
)
german_lambda_marginal = (
german_samples[model.lambda_()].flatten(start_dim=0, end_dim=1).detach()
)
german_beta_marginal = (
german_samples[model.beta()].flatten(start_dim=0, end_dim=1).detach()
)
german_log_prob_test_results = (
german_samples[model.log_prob_test()].flatten(start_dim=0, end_dim=1).detach()
)
This model is too high-dimensional to plot the same way that we did with the previous examples. Let us look at the log density of the samples generated by MCMC. Sometimes the chains can get stuck making little progress, which can be seen from this plot. We observe that NUTS is able to draw samples quite efficiently from across the posterior distribution (compare with Metropolis-Hastings above).
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
german_y = german_log_prob_test_results[:1000]
german_x = np.arange(len(y))
german_cds = ColumnDataSource({"x": german_x.tolist(), "y": german_y.tolist()})
german_tips = [("Log probability", "@y{0.000}"), ("Iteration", "@x")]
german_log_prob_plot = plots.line_plot(
plot_sources=[german_cds],
tooltips=[german_tips],
figure_kwargs={
"title": "Log probability for German credit risk data",
"plot_width": 800,
"x_axis_label": "Iteration",
"y_axis_label": "Log probability",
},
plot_kwargs={"line_alpha": 0.5},
)
show(german_log_prob_plot)
Let us also look at the diagnostics to ensure that the MCMC chains have converged to the posterior distribution. We note that are all close to 1 and the values look reasonable.
filtered_samples = {
k: v for k, v in german_samples.items() if k in {model.beta(), model.lambda_()}
}
az_data = MonteCarloSamples(filtered_samples).to_inference_data()
german_summary_df = az.summary(az_data, round_to=3)
Markdown(german_summary_df.to_markdown())
mean | sd | hdi_5.5% | hdi_94.5% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
beta(SparseLogisticRegression with 25 covariates,)[0, 0] | -0.218 | 0.356 | -0.756 | 0.23 | 0.012 | 0.008 | 953.99 | 1267.42 | 1.003 |
beta(SparseLogisticRegression with 25 covariates,)[1, 0] | -0.782 | 0.12 | -0.965 | -0.588 | 0.003 | 0.002 | 1616.33 | 1357.34 | 1.001 |
beta(SparseLogisticRegression with 25 covariates,)[2, 0] | 1.347 | 0.299 | 0.867 | 1.819 | 0.007 | 0.005 | 1915.53 | 1591.96 | 1.002 |
beta(SparseLogisticRegression with 25 covariates,)[3, 0] | -0.475 | 0.205 | -0.799 | -0.156 | 0.006 | 0.005 | 984.341 | 782.178 | 1.001 |
beta(SparseLogisticRegression with 25 covariates,)[4, 0] | 0.057 | 0.189 | -0.189 | 0.386 | 0.004 | 0.003 | 1863.57 | 1583.56 | 1.001 |
beta(SparseLogisticRegression with 25 covariates,)[5, 0] | -0.291 | 0.144 | -0.537 | -0.07 | 0.004 | 0.003 | 1077.02 | 591.582 | 1.003 |
beta(SparseLogisticRegression with 25 covariates,)[6, 0] | -0.348 | 0.179 | -0.587 | 0.006 | 0.007 | 0.005 | 785.977 | 456.259 | 1 |
beta(SparseLogisticRegression with 25 covariates,)[7, 0] | -0.123 | 0.158 | -0.358 | 0.106 | 0.004 | 0.003 | 1294.97 | 1653.24 | 1.001 |
beta(SparseLogisticRegression with 25 covariates,)[8, 0] | -0.011 | 0.083 | -0.155 | 0.108 | 0.002 | 0.001 | 1682.91 | 1639.8 | 1 |
beta(SparseLogisticRegression with 25 covariates,)[9, 0] | 0.183 | 0.154 | -0.036 | 0.405 | 0.005 | 0.003 | 894.553 | 1343.35 | 1.001 |
beta(SparseLogisticRegression with 25 covariates,)[10, 0] | -0.068 | 0.147 | -0.348 | 0.106 | 0.004 | 0.003 | 1349.61 | 1360.98 | 1 |
beta(SparseLogisticRegression with 25 covariates,)[11, 0] | -0.214 | 0.128 | -0.391 | 0.011 | 0.004 | 0.003 | 1046.93 | 507.06 | 1.005 |
beta(SparseLogisticRegression with 25 covariates,)[12, 0] | 0.111 | 0.174 | -0.114 | 0.391 | 0.004 | 0.003 | 1545.6 | 1841.1 | 1.001 |
beta(SparseLogisticRegression with 25 covariates,)[13, 0] | 0.013 | 0.079 | -0.107 | 0.145 | 0.002 | 0.001 | 1694.56 | 1724.96 | 1.002 |
beta(SparseLogisticRegression with 25 covariates,)[14, 0] | -0.088 | 0.089 | -0.233 | 0.034 | 0.003 | 0.002 | 1096.82 | 1181.23 | 1.002 |
beta(SparseLogisticRegression with 25 covariates,)[15, 0] | -0.275 | 0.294 | -0.727 | 0.073 | 0.009 | 0.007 | 832.849 | 1529.49 | 1.007 |
beta(SparseLogisticRegression with 25 covariates,)[16, 0] | 0.243 | 0.123 | 0.047 | 0.446 | 0.005 | 0.003 | 686.886 | 199.985 | 1.003 |
beta(SparseLogisticRegression with 25 covariates,)[17, 0] | -0.31 | 0.202 | -0.594 | 0.018 | 0.007 | 0.005 | 717.282 | 510.955 | 1.002 |
beta(SparseLogisticRegression with 25 covariates,)[18, 0] | 0.221 | 0.212 | -0.043 | 0.56 | 0.007 | 0.005 | 629.229 | 1192.16 | 1.009 |
beta(SparseLogisticRegression with 25 covariates,)[19, 0] | 0.263 | 0.285 | -0.104 | 0.689 | 0.009 | 0.006 | 781.196 | 1412.12 | 1.008 |
beta(SparseLogisticRegression with 25 covariates,)[20, 0] | 0.151 | 0.138 | -0.056 | 0.359 | 0.004 | 0.003 | 1081.01 | 1508.05 | 1 |
beta(SparseLogisticRegression with 25 covariates,)[21, 0] | -0.062 | 0.108 | -0.253 | 0.082 | 0.003 | 0.002 | 929.626 | 1422.19 | 1.002 |
beta(SparseLogisticRegression with 25 covariates,)[22, 0] | 0.015 | 0.14 | -0.192 | 0.249 | 0.004 | 0.003 | 1312.97 | 809.934 | 1.003 |
beta(SparseLogisticRegression with 25 covariates,)[23, 0] | 0.021 | 0.086 | -0.117 | 0.153 | 0.002 | 0.001 | 1606.22 | 1499.37 | 1.001 |
beta(SparseLogisticRegression with 25 covariates,)[24, 0] | -0.016 | 0.072 | -0.131 | 0.1 | 0.002 | 0.002 | 886.206 | 968.057 | 1.001 |
lambda_(SparseLogisticRegression with 25 covariates,)[0, 0] | 2.101 | 2.943 | 0 | 4.795 | 0.098 | 0.07 | 1039.31 | 1018.09 | 1.002 |
lambda_(SparseLogisticRegression with 25 covariates,)[1, 0] | 5.307 | 4.308 | 1.269 | 9.595 | 0.158 | 0.112 | 726.059 | 940.254 | 1 |
lambda_(SparseLogisticRegression with 25 covariates,)[2, 0] | 8.785 | 7.265 | 1.756 | 16.159 | 0.258 | 0.183 | 627.759 | 1031.12 | 1.004 |
lambda_(SparseLogisticRegression with 25 covariates,)[3, 0] | 3.632 | 4.056 | 0.276 | 7.088 | 0.131 | 0.093 | 695.994 | 859.273 | 1.006 |
lambda_(SparseLogisticRegression with 25 covariates,)[4, 0] | 1.294 | 1.845 | 0.001 | 2.795 | 0.044 | 0.033 | 1544.61 | 963.183 | 1.003 |
lambda_(SparseLogisticRegression with 25 covariates,)[5, 0] | 2.592 | 3.226 | 0.005 | 4.682 | 0.12 | 0.085 | 768.199 | 695.669 | 1 |
lambda_(SparseLogisticRegression with 25 covariates,)[6, 0] | 2.836 | 3.357 | 0.003 | 5.179 | 0.117 | 0.083 | 603.47 | 590.562 | 0.999 |
lambda_(SparseLogisticRegression with 25 covariates,)[7, 0] | 1.415 | 2.102 | 0.005 | 2.845 | 0.06 | 0.046 | 912.985 | 1057.83 | 1.006 |
lambda_(SparseLogisticRegression with 25 covariates,)[8, 0] | 0.951 | 1.472 | 0.002 | 1.942 | 0.072 | 0.051 | 842.951 | 239.897 | 1.003 |
lambda_(SparseLogisticRegression with 25 covariates,)[9, 0] | 1.732 | 2.288 | 0.001 | 3.345 | 0.07 | 0.049 | 692.473 | 735.36 | 1.003 |
lambda_(SparseLogisticRegression with 25 covariates,)[10, 0] | 1.321 | 2.268 | 0.001 | 2.605 | 0.066 | 0.047 | 1168.43 | 1075.63 | 1.001 |
lambda_(SparseLogisticRegression with 25 covariates,)[11, 0] | 1.797 | 1.908 | 0.015 | 3.392 | 0.06 | 0.042 | 667.351 | 630.821 | 1.002 |
lambda_(SparseLogisticRegression with 25 covariates,)[12, 0] | 1.417 | 2.074 | 0.003 | 2.941 | 0.059 | 0.045 | 1363.8 | 1021.44 | 1.005 |
lambda_(SparseLogisticRegression with 25 covariates,)[13, 0] | 0.91 | 1.238 | 0 | 1.85 | 0.033 | 0.023 | 1213.41 | 759.393 | 1.005 |
lambda_(SparseLogisticRegression with 25 covariates,)[14, 0] | 1.161 | 1.607 | 0.001 | 2.331 | 0.047 | 0.033 | 912.611 | 700.183 | 1.001 |
lambda_(SparseLogisticRegression with 25 covariates,)[15, 0] | 2.245 | 2.497 | 0.001 | 5.206 | 0.072 | 0.051 | 844.69 | 724.233 | 1.007 |
lambda_(SparseLogisticRegression with 25 covariates,)[16, 0] | 2.209 | 2.714 | 0.023 | 4.023 | 0.079 | 0.056 | 952.933 | 671.074 | 1.002 |
lambda_(SparseLogisticRegression with 25 covariates,)[17, 0] | 2.508 | 2.787 | 0.003 | 4.857 | 0.103 | 0.073 | 544.242 | 342.13 | 1.002 |
lambda_(SparseLogisticRegression with 25 covariates,)[18, 0] | 2.11 | 2.697 | 0.007 | 4.187 | 0.079 | 0.056 | 653.669 | 699.592 | 1.008 |
lambda_(SparseLogisticRegression with 25 covariates,)[19, 0] | 2.559 | 4.116 | 0.001 | 4.916 | 0.418 | 0.383 | 279.064 | 212.112 | 1.014 |
lambda_(SparseLogisticRegression with 25 covariates,)[20, 0] | 1.568 | 1.894 | 0.002 | 3.097 | 0.06 | 0.042 | 661.574 | 711.995 | 1.002 |
lambda_(SparseLogisticRegression with 25 covariates,)[21, 0] | 1.053 | 1.329 | 0.002 | 2.09 | 0.036 | 0.025 | 1098.9 | 798.229 | 1.002 |
lambda_(SparseLogisticRegression with 25 covariates,)[22, 0] | 1.18 | 1.706 | 0.003 | 2.394 | 0.056 | 0.043 | 1126.2 | 1024.51 | 1.003 |
lambda_(SparseLogisticRegression with 25 covariates,)[23, 0] | 0.917 | 1.438 | 0 | 1.87 | 0.042 | 0.032 | 958.948 | 792.467 | 1.003 |
lambda_(SparseLogisticRegression with 25 covariates,)[24, 0] | 0.86 | 1.369 | 0.002 | 1.75 | 0.044 | 0.031 | 1157.46 | 1089.54 | 1.002 |
To check for sparsity, let us plot the histogram of the marginals for a selection of coefficients () in the model. We note that many of the parameters have their posterior probability mass concentrated at 0, resulting in a sparse model, as we desired.
# Required for visualizing in Colab.
output_notebook(hide_banner=True)
german_betas = german_samples[model.beta()].squeeze(-1).reshape(-1, 25)
german_plot_sources = []
german_labels = []
german_tooltips = []
for i in range(25):
b = german_betas[:, i].numpy()
support, density = az.stats.density_utils.kde(b)
density /= density.max()
cds = ColumnDataSource({"x": support.tolist(), "y": density.tolist()})
label = f"Ξ²{i}"
tips = [(f"{label}", "@support{0.000}")]
german_plot_sources.append(cds)
german_labels.append(label)
german_tooltips.append(tips)
german_betas_density_plot = plots.line_plot(
plot_sources=german_plot_sources,
labels=german_labels,
tooltips=german_tooltips,
plot_kwargs={"line_width": 1, "line_alpha": 0.4},
figure_kwargs={
"title": "Histogram of some sample model coefficients",
"x_axis_label": "Ξ²",
"plot_height": 700,
},
)
german_betas_density_plot.yaxis.visible = False
german_betas_density_plot.legend.click_policy = "mute"
show(german_betas_density_plot)
Finally, let us evaluate the mean predictions from the model using the ROC (Receiver Operating Characteristic) plot. The area under the curve (AUC) is 0.83 which shows that the model is able to achieve a good classification accuracy. This is merely evaluating the mean forecast from the model, but as we saw earlier, we have a full distribution of the probability scores corresponding to each test data.
y_true = Y_test.squeeze(-1).numpy()
y_pred_mean = german_samples[model.y_probs()].squeeze(-1).mean([0, 1]).numpy()
false_positive_rate, true_positive_rate, _ = sklearn.metrics._ranking.roc_curve(
y_true,
y_pred_mean,
)
roc_score = sklearn.metrics.roc_auc_score(y_true, y_pred_mean)
roc_cds = ColumnDataSource(
{
"x": false_positive_rate.tolist(),
"y": true_positive_rate.tolist(),
}
)
roc_tips = [("TPF", "@y{0.000}"), ("FPR", "@x{0.000}")]
roc_plot = plots.line_plot(
plot_sources=[roc_cds],
tooltips=[roc_tips],
labels=[f"Classifier (AUC = {roc_score:.2f})"],
figure_kwargs={
"title": "Receiver Operator Characteristic",
"x_axis_label": "False positive rate (positive label: 1)",
"y_axis_label": "True positive rate (positive label: 1)",
},
plot_kwargs={
"line_width": 3,
"line_alpha": 0.7,
"hover_line_color": "orange",
"hover_line_alpha": 1,
},
)
roc_plot.legend.location = "bottom_right"
show(roc_plot)
References
- Carvalho CM, Polson NG, Scott JG. Handling sparsity via the horseshoe. In: van Dyk D, Welling M, editors. Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics; 2009. pp. 73β80. Available from http://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf.
- Piironen J, Vehtari A. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics. 2017;11(2) 5018β5051. doi: 10.1214/17-EJS1337SI.