Probabilistic principal components analysis
Tutorial: Probabilistic PCA
This tutorial demonstrates the Probabilistic Principal Component Analysis (PPCA, Tipping & Bishop 1998) dimensionality reduction technique using Bean Machine's variational inference (VI) implementation.
Given pieces of -dimensional data , we are interested in finding a -dimensional subspace (represtented by the columns of a matrix ) which summarizes the variation in the data.
Prerequisites
# Install Bean Machine in Colab if using Colab.
import sys
if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
!pip install beanmachine
import beanmachine.ppl as bm
import matplotlib.pyplot as plt
import torch
import torch.distributions as dist
from beanmachine.ppl.inference.vi import ADVI, MAP, VariationalInfer
import os
# Plotting settings
plt.rc("figure", figsize=[8, 6])
plt.rc("font", size=14)
plt.rc("lines", linewidth=2.5)
# Manual seed
bm.seed(11)
torch.manual_seed(11)
# Other settings for the notebook.
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ
Generative model and example data
The PPCA generative model is defined by
- Latent variables for
- Principal axes for ,
- Data likelihood
In the following cell, we use Bean Machine's
simulate()
API to generate example data from this generative model.
N = 5000 # number of data points
D = 2 # data dimensionality
K = 1 # latent dimensionality
sigma = 1. # observation noise scale
@bm.random_variable
def w():
return dist.Normal(0., 2.).expand((K,D))
@bm.random_variable
def z():
return dist.Normal(0., 1.).expand((N,K))
@bm.random_variable
def x():
return dist.Normal(z() @ w(), sigma)
train_data = bm.simulate(queries=[w(), z(), x()], num_samples=1).get_chain(0)
print(f"True principal axes:\n{train_data[w()]}")
Since , we can also visualize the data in a scatterplot. We also show the true -dimensional (because ) principal component used to generate the example data.
fig, ax = plt.subplots()
ax.scatter(train_data[x()][:,0], train_data[x()][:,1], color='blue', alpha=0.1)
ax.arrow(0, 0, *train_data[w()].flatten(), linewidth=2, color='red', head_width=0.3, label='True PC1')
ax.axis([-10, 10, -10, 10])
ax.legend()
To recover classical PCA, marginalize out the latent variable to find and notice that when observation noise the maximum likelihood estimate for is the classical PCA objective of best (in -norm sense) rank approximation to the empirical covariance .
This can be solved through singular value decomposition (SVD), and in the following cell we compute and visualize the classical PCA estimate for this example.
ax.arrow(0, 0, *train_data[x()].svd().V[:,0], linewidth=2, color='yellow', head_width=0.3, label='PCA_SVD(k=1)')
ax.legend()
fig
Variational Model
Classical PCA does not utilize the prior distribution . In this synthetic setting where the prior is known to be the data generating process, this is a missed opportunity!
To incorporate the prior in a Bayesian framework, we consider PPCA as a Bayesian inference problem with target posterior .
Maximum A Posteriori (MAP) Inferenceβ
Here we use maximum a posteriori (MAP) inference to find the values of latent variables and which maximize the posterior . MAP inference can be a good choice if you need a Bayesian treatment which incorporates prior distributions but do not need uncertainty estimates.
It generalizes across different choices of prior distributions, and it returns a point estimate which trades off between the prior and the likelihood .
As a gradient descent algorithm, it is slower than classical PCA which can utilize optimized linear algebra libraries for SVD).
v_world = MAP(
queries=[w(), z()],
observations={x(): train_data[x()]},
).infer(
num_steps=1000,
)
ax.arrow(0, 0, *v_world.get_guide_distribution(w()).mean.flatten().detach(), linewidth=2, color='green', head_width=0.3, label='PCA_MAP(k=1)')
ax.legend()
fig
Automatic Differentiation Variational Inference (ADVI)β
If posterior uncertainty estimates are required and a mean-field Gaussian factorization to the posterior approximation is acceptable, then ADVI can be an appropriate choice.
As a pedagogical exercise, we first show how ADVI can be expressed using Bean Machine's
VariationalInfer
API. It is a lower-level API which is useful when control over variational parameter
initialization and non-trivial factorizations of guide distributions are required.
q_z_loc = bm.param(lambda: torch.zeros((N,K)))
q_z_scale = bm.param(lambda: torch.zeros((N,K)))
q_w_loc = bm.param(lambda: torch.zeros((K,D)))
q_w_scale = bm.param(lambda: torch.zeros((K,D)))
@bm.random_variable
def q_z():
return dist.Normal(q_z_loc(), torch.nn.functional.softplus(q_z_scale()))
@bm.random_variable
def q_w():
return dist.Normal(q_w_loc(), torch.nn.functional.softplus(q_w_scale()))
v_world = VariationalInfer(
queries_to_guides={
z(): q_z(),
w(): q_w(),
},
observations={
x(): train_data[x()],
},
).infer(
num_steps=1000,
)
As ADVI yields a distributional estimate over principal components , below we plot 100 samples from the ADVI variational approximation
fig, ax = plt.subplots()
ax.scatter(train_data[x()][:,0], train_data[x()][:,1], color='blue', alpha=0.1)
ax.arrow(0, 0, *train_data[w()].flatten(), linewidth=2, color='red', head_width=0.3, label='True PC1')
ax.axis([-10, 10, -10, 10])
for _ in range(100):
ax.arrow(0, 0, *2*v_world.get_guide_distribution(w()).sample().flatten().detach(), linewidth=2, color='yellow', head_width=0.3, alpha=0.05, label='PCA ADVI' if _ == 0 else None)
ax.axis([-10, 10, -10, 10])
ax.legend()
In practice, we recommend the
AutoGuide
API for convenience when mean-field assumptions such as those used for ADVI are
appropriate. The following cell shows how to automatically define and use ADVI guides
for variational inference.
v_world = ADVI(
queries=[w(), z()],
observations={x(): train_data[x()]},
).infer(
num_steps=1000,
)
Visualization
We visualize the variational guide distribution obtained for the principal component . It should be aligned (up to scaling) with the true principal component.
fig, ax = plt.subplots()
ax.scatter(train_data[x()][:,0], train_data[x()][:,1], color='blue', alpha=0.05)
ax.arrow(0, 0, *train_data[w()].flatten(), linewidth=2, color='red', head_width=0.3, label='True PC1')
for _ in range(100):
ax.arrow(0, 0, *2*v_world.get_guide_distribution(w()).sample().flatten().detach(), linewidth=2, color='yellow', head_width=0.5, alpha=0.05, label='PCA ADVI' if _ == 0 else None)
ax.axis([-10, 10, -10, 10])
ax.legend()
As another check, we verify whether the variational approximations generate data that is similar to the original generative model sample.
fig, ax = plt.subplots()
ax.scatter(train_data[x()][:,0], train_data[x()][:,1], color='blue', alpha=0.05, label='true data')
with torch.no_grad():
x_gen = dist.Normal(
loc=v_world.get_guide_distribution(z()).sample() @ v_world.get_guide_distribution(w()).sample(),
scale=sigma,
).sample()
ax.scatter(x_gen[:,0], x_gen[:,1], color='red', alpha=0.05, label='generated')
ax.axis([-10, 10, -10, 10])
ax.legend()