# Tutorial: Maximum likelihood and maximum a priori point estimation

This tutorial demonstrates how maximum likelihood estimation (MLE) and maximum a priori (MAP) inference problems can be specified and solved using Bean Machine's variational inference. MLE is treated as a special case of MAP where priors are uninformative, and MAP is treated as a special case of VI point estimation (i.e. with Delta guide distributions).

## Prerequisites​

# Install Bean Machine in Colab if using Colab.import sysif "google.colab" in sys.modules and "beanmachine" not in sys.modules:    !pip install beanmachine
import beanmachine.ppl as bmimport matplotlib.pyplot as pltimport torchimport torch.distributions as distfrom beanmachine.ppl.distributions import Flatfrom beanmachine.ppl.inference.vi import MAP
import os# Plotting settingsplt.rc("figure", figsize=[8, 6])plt.rc("font", size=14)plt.rc("lines", linewidth=2.5)# Manual seedbm.seed(11)torch.manual_seed(11)# Other settings for the notebook.smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ

## Data​

Consider data following the OLS model $Y \sim N([X; 1]^\top \beta_{\text{true}}, 1.0)$.

N = 900true_beta = torch.tensor([2.0, 5.0])@bm.random_variabledef X():    return dist.Normal(0, 1).expand((N,1))def beta():    return true_beta@bm.random_variabledef Y():    return dist.Normal((torch.cat([X(), torch.ones((N,1))], dim=1) @ beta()).squeeze(), 1.0)data = bm.simulate([X(), Y()], num_samples=1).get_chain(0)data_X, data_Y = data[X()], data[Y()]xs = torch.linspace(-3, 3, steps=100)fig, ax = plt.subplots()ax.scatter(data_X, data_Y, label='data')ax.legend()
Out:
Samples collected:   0%|          | 0/1 [00:00<?, ?it/s]
Out:
<matplotlib.legend.Legend at 0x145765100> ## Maximum Likelihood Estimation (MLE)​

By placing a Flat uninformative prior over $\beta$, the prior has no effect and the likelihood and posterior become equal. Therefore, we can use MAP to search for the MLE estimate.

@bm.random_variabledef beta():    return Flat(shape=(2,1))@bm.random_variabledef Y():    return dist.Normal((torch.cat([X(), torch.ones((N,1))], dim=1) @ beta()).squeeze(), 1.0)v_world = MAP(    queries=[beta()],    observations={        X(): data_X,        Y(): data_Y,    },).infer(num_steps=1500)with v_world:    print(beta())
Out:
  0%|          | 0/1500 [00:00<?, ?it/s]
Out:
tensor([[1.9999],        [4.9858]], requires_grad=True)
with v_world, torch.no_grad():    ax.plot(xs, torch.cat([xs.unsqueeze(1), torch.ones((len(xs),1))], dim=1) @ beta(), color='yellow', label='MLE')ax.legend()fig MAP performs gradient descent on a quadratic potential centered at the standard OLS estimator, which is known to be given by the Moore-Penrose pseudo-inverse: $$\hat\beta_{OLS} = X^\dagger y = (X^\top X)^{-1} X^\top y$$

MAP performs gradient descent on a quadratic potential centered at the standard OLS estimator, which is known to be given by the Moore-Penrose pseudo-inverse: $$\hat\beta_{OLS} = X^\dagger y = (X^\top X)^{-1} X^\top y$$

beta_ols = torch.linalg.pinv(torch.cat([data_X, torch.ones((N,1))], dim=1)) @ data_Yprint(beta_ols)
Out:
tensor([2.0001, 5.0015])
ax.plot(xs, torch.cat([xs.unsqueeze(1), torch.ones((len(xs),1))], dim=1) @ beta_ols, color='red', label='OLS')ax.legend()fig ## MAP Inference​

To incorporate prior beliefs, such as $\beta \sim N(0,\tau)$ in this example, the Flat uninformative prior is replaced with a prior distribution that with high precision ($\sigma = 0.1$ is small) assigns the majority of probability belief near zero. This prior results in a combined posterior which "shrinks" estimates away from the MLE and back towards the prior (in this case, both regression coefficients $\beta_i$ are shrunken towards $0$.

sigma = 0.1@bm.random_variabledef beta():    return dist.Normal(0, sigma).expand((2,1))@bm.random_variabledef Y():    return dist.Normal((torch.cat([X(), torch.ones((N,1))], dim=1) @ beta()).squeeze(), 1.0)v_world = MAP(    queries=[beta()],    observations={        X(): data_X,        Y(): data_Y,    },).infer(num_steps=1000)with v_world:    print(beta())
Out:
  0%|          | 0/1000 [00:00<?, ?it/s]
Out:
tensor([[1.7862],        [4.4997]], requires_grad=True)
with v_world, torch.no_grad():    ax.plot(xs, torch.cat([xs.unsqueeze(1), torch.ones((len(xs),1))], dim=1) @ beta(), color='cyan', label='MAP')ax.legend()fig It turns out some MAP estimates correspond to frequentist regularization techniques. In the case of linear regression, MAP inference with a Normal prior is equivalent to Tikhonov or $L_2$ regularization where the regularization parameters $\lambda = \sigma^{-2}$ the prior variance: $$\hat{\beta}_{L_2,\lambda} = (X^\top X + \sigma^{-2} I)^{-1} X^\top y$$

X_full = torch.cat([data_X, torch.ones((N,1))], dim=1)beta_l2 = (torch.linalg.inv(X_full.T @ X_full + sigma**-2 * torch.eye(2)) @ X_full.T @ data_Y)print(beta_l2)
Out:
tensor([1.7862, 4.4998])
ax.plot(xs, torch.cat([xs.unsqueeze(1), torch.ones((len(xs),1))], dim=1) @ beta_l2, color='magenta', label='L2')ax.legend()fig 