# Logistic regression

## Tutorial: Bayesian Logistic Regression​

The purpose of this tutorial is to show how to build a simple Bayesian model to deduce the line which separates two categories of points.

## Problem​

A logistic model is a statistical model where we have a collection of things that can be divided into two categories — pictures of cats or dogs, patients who are immune or susceptible, students who pass or fail, and so on. An assumption of the model is: the probability that a given thing is in a category can be computed by taking a linear combination of characteristics of the thing.

The problem we seek to solve in this tutorial is: when given a set of examples of each category, can we infer the boundary between the categories?

## Prerequisites​

We will be using the following packages within this tutorial.

Let's code this in Bean Machine! Import the Bean Machine library and some fundamental PyTorch classes.

# Install Bean Machine in Colab if using Colab.import sysif "google.colab" in sys.modules and "beanmachine" not in sys.modules:    !pip install beanmachine
import osimport warningsimport arviz as azimport beanmachine.ppl as bmimport torchimport torch.distributions as distfrom beanmachine.ppl.inference.monte_carlo_samples import MonteCarloSamplesfrom beanmachine.tutorials.utils import plotsfrom bokeh.io import output_notebookfrom bokeh.models import ColumnDataSource, MultiLine, Spanfrom bokeh.plotting import showfrom IPython.display import Markdown

The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for reproducibility.

# Eliminate excess UserWarnings from Python.warnings.filterwarnings("ignore")# Plotting settingsaz.rcParams["plot.backend"] = "bokeh"az.rcParams["stats.hdi_prob"] = 0.89# Manual seedbm.seed(0)# Other settings for the notebook.smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ

## Example: orange and blue points​

For the purposes of this tutorial our "things" will be two-dimensional points between $(-10,-10)$ and $(10,10)$. Our two categories will be represented by $0.0$, which we'll render as blue, and $1.0$, which we'll render as orange.

The probability that a given point $(x,y)$ will be orange is

$P((x,y)\text{ is orange})=\Large{\frac{1}{1+e^{-b_0-b_1x-b_2y}}}$

for some coefficients $b_0,b_1,b_2$.

That equation is seen to be a linear combination of properties of the point if we express the probability as log odds:

$\text{logit}(P((x,y)\text{ is orange}))=b_0+b_1x+b_2y$

The boundary between the two categories is the set of points where the probability is $0.5$ that the point is orange; the log odds of $0.5$ is $0$, so the equation of the set of $(x, )$ points separating the categories is a straight line:

$b_0+b_1x+b_2y=0$

Or, expressed as slope and intercept:

$y=-\frac{b_1}{b_2}x-\frac{b_0}{b_2}$

Our goal therefore is to infer possible values for $b_0,b_1,b_2$ from which we can compute the line separating the categories.

## Creating sample data​

We start by creating a data set to illustrate the model. In order to make computations faster and easier, we will express the entire data set as a single tensor representing 200 points. Each row of the tensor will be in the form $\begin{bmatrix}1.0&x&y\end{bmatrix}$. That way we can compute the linear combination by matrix-multiplying each row by $\begin{bmatrix}b_0\\b_1\\b_2\end{bmatrix}$ to obtain the log odds that the point is orange: $b_0+b_1x+b_2y$.

For our synthetic dataset, we will assume the following parameters.

N = 200low = -10.0high = 10.0uniform = dist.Uniform(    low=torch.tensor([1.0, low, low]),    high=torch.tensor([1.0, high, high]),)points = torch.tensor([uniform.sample().tolist() for i in range(N)]).view(N, 3)
# Required for visualizing in Colab.output_notebook(hide_banner=True)x = points[:, 1]y = points[:, 2]cds = ColumnDataSource({"x": x.tolist(), "y": y.tolist()})tips = [("y", "@y{0.000}"), ("x", "@x{0.000}")]synthetic_data_plot = plots.scatter_plot(    plot_sources=cds,    tooltips=tips,    figure_kwargs={        "title": "Synthetic data",        "x_axis_label": "x",        "y_axis_label": "y",    },    plot_kwargs={"fill_color": "black"},)show(synthetic_data_plot)

We now assign points to categories $0.0$ (blue) and $1.0$ (orange).

For this example we will assign categories to points using $\begin{bmatrix}-2.0\\0.3\\-0.5\end{bmatrix}$ as our coefficients, which makes the line separating the categories $y=0.6x-4$.

true_coefficients = torch.tensor([-2.0, 0.3, -0.5]).view(3, 1)true_slope = -float(true_coefficients / true_coefficients)true_intercept = -float(true_coefficients / true_coefficients)def log_odds(point):    return point.view(1, 3).mm(true_coefficients)observed_categories = torch.tensor(    [dist.Bernoulli(logits=log_odds(point)).sample() for point in points])

## Data visualization methods​

It is useful to have helper methods to visualize the synthetic data set and the line which separates the two categories.

def categorize_points(points, categories):    orange_x = []    orange_y = []    blue_x = []    blue_y = []    for point, category in zip(points, categories):        if category == 1:            orange_x.append(float(point))            orange_y.append(float(point))        else:            blue_x.append(float(point))            blue_y.append(float(point))    return {        "orange": {"x": orange_x, "y": orange_y, "label": ["orange"] * len(orange_x)},        "blue": {"x": blue_x, "y": blue_y, "label": ["blue"] * len(blue_x)},    }def plot_line(slope, intercept, high=10, low=-10):    if intercept > high or intercept < low:        return    xs = [low, high]    ys = [slope * low + intercept, slope * high + intercept]    if ys > high:        xs = (high - intercept) / slope        ys = high    elif ys < low:        xs = (low - intercept) / slope        ys = low    if ys > high:        xs = (high - intercept) / slope        ys = high    elif ys < low:        xs = (low - intercept) / slope        ys = low    return xs, ys
# Required for visualizing in Colab.output_notebook(hide_banner=True)points_with_categories = categorize_points(points, observed_categories)orange_cds = ColumnDataSource(    {        "x": points_with_categories["orange"]["x"],        "y": points_with_categories["orange"]["y"],        "label": points_with_categories["orange"]["label"],    })orange_tips = [("Category", "@label"), ("y", "@y{0.000}"), ("x", "@x{0.000}")]blue_cds = ColumnDataSource(    {        "x": points_with_categories["blue"]["x"],        "y": points_with_categories["blue"]["y"],        "label": points_with_categories["blue"]["label"],    })blue_tips = [("Category", "@label"), ("y", "@y{0.000}"), ("x", "@x{0.000}")]synthetic_data_with_categories_plot = plots.scatter_plot(    plot_sources=[orange_cds, blue_cds],    tooltips=[orange_tips, blue_tips],    figure_kwargs={        "title": "Synthetic data with categories",        "x_axis_label": "x",        "y_axis_label": "y",    },    legend_items=["Category orange", "Category blue"],    plot_kwargs={"fill_color": "label"},)# Add the separating line.x, y = plot_line(true_slope, true_intercept)synthetic_data_with_categories_plot.line(    x=x,    y=y,    legend_label="Separating line",    line_color="black",    line_width=3,    line_alpha=1,)show(synthetic_data_with_categories_plot)

## Model​

We can now start building our model.

The first thing we need is a prior distribution for our three coefficients.

We have no reason to believe that the coefficients will be either positive or negative, so we should choose a prior distribution that is centered on zero. We use a matrix multiplication to compute the linear combination, and therefore make the prior a column vector of samples from a normal distribution:

@bm.random_variabledef coefficients():    mean = torch.zeros(3, 1)    sigma = torch.ones(3, 1)    return dist.Normal(mean, sigma)

Our model for categories is now straightforward: each category is chosen by matrix-multiplying the point by the prior distribution of coefficients, and we get a set of categories from the Bernoulli distribution; either $0.0$ (blue) or $1.0$ (orange):

@bm.random_variabledef categories():    return dist.Bernoulli(logits=points.mm(coefficients()))

## Inference​

We can now infer the posterior distribution of the coefficients given the observations:

num_samples = 2 if smoke_test else 2000num_adaptive_samples = 0 if smoke_test else num_samples // 2num_chains = 1 if smoke_test else 4observations = {categories(): observed_categories.view(N, 1)}queries = [coefficients()]mc = bm.GlobalNoUTurnSampler()
%%timesamples = mc.infer(    queries=queries,    observations=observations,    num_samples=num_samples,    num_chains=num_chains,    num_adaptive_samples=num_adaptive_samples,)
Out:
Samples collected:   0%|          | 0/3000 [00:00<?, ?it/s]Samples collected:   0%|          | 0/3000 [00:00<?, ?it/s]Samples collected:   0%|          | 0/3000 [00:00<?, ?it/s]Samples collected:   0%|          | 0/3000 [00:00<?, ?it/s]CPU times: user 33.2 s, sys: 107 ms, total: 33.3 sWall time: 33.2 s
sampled_coefficients = samples.get_chain()[coefficients()]

The slopes and intercepts are computed from the sampled coefficients:

slopes = [-float(s / s) for s in sampled_coefficients if float(s) != 0.0]intercepts = [-float(s / s) for s in sampled_coefficients if float(s) != 0.0]

## Posterior results​

A histogram of the inferred slopes should cluster near the true value, marked in red. A slight deviation is to be expected since there is noise in our data.

# Required for visualizing in Colab.output_notebook(hide_banner=True)slope_hist_plot = plots.histogram_plot(slopes)# Add a line showing the true slope.span = Span(    location=true_slope,    dimension="height",    line_color="red",    line_width=3,)slope_hist_plot.add_layout(span)show(slope_hist_plot)

And similarly for the intercepts. Our model actually predicts a higher median intercept to capture the orange points above the slope in our ground truth.

# Required for visualizing in Colab.output_notebook(hide_banner=True)intercept_hist_plot = plots.histogram_plot(intercepts)# Add a line showing the true slope.span = Span(    location=true_intercept,    dimension="height",    line_color="red",    line_width=3,)intercept_hist_plot.add_layout(span)show(intercept_hist_plot)

Inference typically finds a reasonable distribution of possible lines that separate these two categories.

## Diagnostics​

In addition to visualizing the lines directly, we can also use the diagnostics capabilities of Arviz to examine the diagnostics of our variables:

filtered_samples = {k: v for k, v in samples.items() if k == coefficients()}az_data = MonteCarloSamples(filtered_samples).to_inference_data()summary_df = az.summary(az_data, round_to=3).to_markdown()Markdown(summary_df)
meansdhdi_5.5%hdi_94.5%mcse_meanmcse_sdess_bulkess_tailr_hat
coefficients()[0,0]-1.7740.294-2.245-1.3130.0050.0043561.25042.121.001
coefficients()[1,0]0.2620.0480.1850.3360.0010.0013543.114425.91
coefficients()[2,0]-0.4520.064-0.548-0.3420.0010.0013137.524031.131.001

As you can see, the average and median (50%) values inferred are reasonably close to the true values but there is a large spread. Remember that any constant multiple of the true values would produce the same line; if we got $-4.0$, $0.6$ and $-1.2$ then the slope and intercept would be the same and only the amount of "mixing" near the line would change, so we can expect that this inference problem may have some fairly large variance.

The n_eff column is the effective sample size, which indicates how correlated the posterior samples are to each other; higher is better. This effective sample size is a little low.

Notice that we generated the data by separating points along a line with slope $0.6$ and intercept $-4.0$, but were we to observe the specific set of 200 points we classified without knowing the true parameters ahead of time, we would deduce that the separating line had a slightly lower slope than $0.6$ and an intercept around $-3.0$.

To more clearly illustrate the accuracy of the inference, we can take a random selection of the inferred lines and plot them on the data set:

# Required for visualizing in Colab.output_notebook(hide_banner=True)# Replicate the original data separating plot above.randomly_selected_lines_plot = plots.scatter_plot(    plot_sources=[orange_cds, blue_cds],    tooltips=[orange_tips, blue_tips],    figure_kwargs={        "title": "Synthetic data with categories",        "x_axis_label": "x",        "y_axis_label": "y",    },    legend_items=["Category orange", "Category blue"],    plot_kwargs={"fill_color": "label"},)# Add randomly selected sampled separating lines.num_lines = 25sampled_indices = torch.randint(0, len(slopes), (num_lines,)).tolist()xs = []ys = []for sampled_index in sampled_indices:    sampled_slope = slopes[sampled_index]    sampled_intercept = intercepts[sampled_index]    x, y = plot_line(sampled_slope, sampled_intercept)    xs.append(x)    ys.append(y)cds = ColumnDataSource({"xs": xs, "ys": ys})glyph = MultiLine(xs="xs", ys="ys", line_color="magenta", line_alpha=0.2)randomly_selected_lines_plot.add_glyph(cds, glyph)# Add the separating line.x, y = plot_line(true_slope, true_intercept)randomly_selected_lines_plot.line(    x=x,    y=y,    legend_label="Separating line",    line_color="black",    line_width=3,    line_alpha=1,)show(randomly_selected_lines_plot)