Hidden Markov Model
Tutorial: Hidden Markov modelβ
This tutorial demonstrates modeling and running inference on a hidden Markov model (HMM) in Bean Machine. The flexibility of this model allows us to demonstrate some of the great unique features of Bean Machine, such as block inference, compositional inference, and separation of data from the model.
Problemβ
HMMs are a class of probabilistic models which are popular for doing inference on discrete-time stochastic processes. In general, Markov models are used to study a sequence of random variables, , where the sequence is "memoryless" such that the distribution of depends only on the value of ; any sequence which is memoryless is said to satisfy the Markov property. One reason Markov models are popular is because this flexible framework can be used to model many time-evolving processes, such as words in a sentence, the position of a robot, or the weather.
An HMM is a Markov model in which observations are modeled as being noisy. While we are interested in doing inference on each , we are actually observing variables which can depend on the values of in a variety of ways. In specific settings, HMMs can be very tractable, and lend themselves towards provably-optimal algorithms such as Kalman Filtering. Here, we illustrate how to do general inference with MCMC as applicable to general HMMs. The single-site algorithms underlying Bean Machine enable inference algorithms which scale favorably with the size of the HMM.
Prerequisitesβ
Let's code this in Bean Machine! Import the Bean Machine library, some fundamental PyTorch classes, and optionally typing for our code.
# Install Bean Machine in Colab if using Colab.
import sys
if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
!pip install beanmachine
import logging
import os
import warnings
import beanmachine.ppl as bm
import matplotlib.pyplot as plt
import torch
import torch.distributions as dist
The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for reproducibility.
# Eliminate excess inference logging from Bean Machine, except for progress bars.
logging.getLogger("beanmachine").setLevel(50)
warnings.filterwarnings("ignore")
# Manual seed
bm.seed(111)
# Other settings for the notebook.
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ
Modelβ
We model the hidden state as being a discrete-time Markov chain with states and transition matrix (a.k.a. kernel) . We model time steps of this chain with variables , and use variables to model observable emissions of each with Gaussian noise.
Formally, the transition and emission probabilities are as follows, for :
Accordingly, priors can be assigned as follows, for :
Finally, assume that the value of is known:
So the model is set by choosing the prior through the values of: . ( stands for concentration).
We can implement this model in Bean Machine by defining random variable objects with the
@bm.random_variable
decorator. These functions behave differently than ordinary Python
functions.
@bm.random_variable
functions:- They must return PyTorch
Distribution
objects. - Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
- Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
- Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
- Please see the documentation for more information about this decorator.
Note also that, compared to the statistical notation above, our implementation uses 0-indexing instead of 1-indexing.
class HiddenMarkovModel:
def __init__(
self,
N: int,
K: int,
concentration: float,
mu_loc: float,
mu_scale: float,
sigma_shape: float,
sigma_rate: float,
) -> None:
self.N = N
self.K = K
self.concentration = concentration
self.mu_loc = mu_loc
self.mu_scale = mu_scale
self.sigma_shape = sigma_shape
self.sigma_rate = sigma_rate
@bm.random_variable
def Theta(self, k):
return dist.Dirichlet(torch.ones(self.K) * self.concentration / self.K)
@bm.random_variable
def Mu(self, k):
return dist.Normal(self.mu_loc, self.mu_scale)
@bm.random_variable
def Sigma(self, k):
return dist.Gamma(self.sigma_shape, self.sigma_rate)
@bm.random_variable
def X(self, n: int):
if n == 0:
return dist.Categorical(torch.tensor([1.0] + [0.0] * (self.K - 1)))
else:
return dist.Categorical(self.Theta(self.X(n - 1).item()))
@bm.random_variable
def Y(self, n: int):
return dist.Normal(self.Mu(self.X(n).item()), self.Sigma(self.X(n).item()))
Dataβ
First, we will generate random observations and choose the priors.
def generate_chain_observations(theta, mus, sigmas, N):
theta_distbns = {j: dist.Categorical(vector) for j, vector in enumerate(theta)}
hidden = [0]
while len(hidden) < N:
hidden.append(theta_distbns[hidden[-1]].sample().item())
def observe(k):
return dist.Normal(mus[k], sigmas[k]).sample().item()
return hidden, list(map(observe, hidden))
concentration = 1.1
mu_loc = 0.0
mu_scale = 5.0
sigma_shape = 0.5
sigma_rate = 1.0
N = 15
K = 2
thetas_obs = dist.Dirichlet(torch.ones(K) * concentration / K).sample((K,))
mus_obs = dist.Normal(mu_loc, mu_scale).sample((K,))
sigmas_obs = dist.Gamma(sigma_shape, sigma_rate).sample((K,))
xs_obs, ys_obs = generate_chain_observations(thetas_obs, mus_obs, sigmas_obs, N)
x_obs = torch.tensor(xs_obs)
y_obs = torch.tensor(ys_obs)
# Initialize model
model = HiddenMarkovModel(
N,
K,
concentration,
mu_loc,
mu_scale,
sigma_shape,
sigma_rate,
)
Inferenceβ
Inference is the process of combining model with data to obtain insights, in the form of probability distributions over values of interest. Bean Machine offers a powerful and general inference framework to enable fitting arbitrary models to data.
Our model makes use of both continuous and discrete random variables. We'll want to make
use of different inference strategies for each. In particular, we would like to take
advantage of gradient information for the continuous random variables. To facilitate
this, Bean Machine provides the CompositionalInference
class.
CompositionalInference
is a powerful, flexible class for configuring inference in a
variety of ways. By default, CompositionalInference
will select an inference method
for each random variable that is appropriate based on its support. The HMM presented in
this tutorial has a number of different random variables that we're interested in
learning about, and we will customize our own. The proposers for each family of random
variables are summarized in the table below:
Random variable | Support | Inference method |
---|---|---|
Uniform Metropolis Hastings | ||
Newtonian Monte Carlo (real space proposer) | ||
Newtonian Monte Carlo (half space proposer) |
You can learn more about compositional inference in our framework topics.
Normally, this is all you would have to do! However, the HMM model has meaningful
structure that we would like to consider when configuring our inference strategy. In
particular, the hidden state of each time step, , is highly correlated with the
state . Thus, we would also like to jointly propose new values for all β it
is very likely that the value of a hidden state becomes invalid and needs to be
recomputed after is updated. Similarly, we would like to update the location
and the scale of the hidden states jointly as well. In order to update
these variables jointly, we can configure CompositionalInference
to "block" the random
variables together.
CompositionalInference
accepts a dictionary that maps families of random variable to
the corresponding algorithm, which allow you to override the default inference method
for a particular subset of nodes, or group some of them into a block. To define a block,
we simply need to pass CompositionalInference
a tuple containing all random variable
families that we want to propose jointly as a key. In our case, since we don't want to
override the default inference method, we can use ...
instead of providing an
inference class for the block.
compositional = bm.CompositionalInference(
{(model.X): ..., (model.Sigma, model.Mu): bm.SingleSiteNewtonianMonteCarlo()}
)
You may notice that we are using what we referred to as "random variable families" such
as model.X
as keys, which are essentially functions that generates the random
variables, instead of using instantiated random variable like model.X(0)
and
model.X(1)
. This is because often times the number of random variables is not known
ahead of time until an inference starts with some data (you can even have an unbounded
number of nodes for some models). By using random variable families in the config, we no
longer need to explicitly spell out all instances of the random variables and group them
in a huge tuple.
The next step is to define the queries and observations. For this particular run, we're interested in inferring , , and .
queries = (
[model.X(n) for n in range(1, model.N)]
+ [model.Mu(k) for k in range(model.K)]
+ [model.Sigma(k) for k in range(model.K)]
)
observations = {
model.X(0): torch.tensor(0.0),
**{model.Y(n): y_obs[n] for n in range(model.N)},
**{model.Theta(k): thetas_obs[k] for k in range(model.K)},
}
Running inference consists of a few arguments:
Name | Usage |
---|---|
queries | List of @bm.random_variable targets to fit posterior distributions for. |
observations | A dictionary of observations, as built above. |
num_samples | Number of Monte Carlo samples to approximate the posterior distributions for the variables in queries . |
num_chains | Number of separate inference runs to use. Multiple chains can help verify that inference ran correctly. |
num_samples = 400 if not smoke_test else 1
num_chains = 1
samples = compositional.infer(
queries=queries,
observations=observations,
num_samples=num_samples,
num_chains=num_chains,
)
samples = samples.get_chain(0)
# X(0) is observed and is not part of query
x_samples = torch.stack(
[torch.zeros(num_samples)] + [samples[model.X(n)] for n in range(1, model.N)], dim=1
)
mu_samples = torch.stack([samples[model.Mu(k)] for k in range(model.K)]).T
sigma_samples = torch.stack([samples[model.Sigma(k)] for k in range(model.K)]).T
Visualizationβ
We will look at the values of the samples collected for and . We will take the mean of samples taken over the last 10% of the chain, and compare these to our synthetic data.
tail_len = num_samples // 10
xs_tail_plot = x_samples[-tail_len:].mean(0)
plt.scatter(
range(len(xs_tail_plot)),
xs_tail_plot,
alpha=0.5,
label="Inferred",
c="b",
)
plt.scatter(range(len(xs_obs)), xs_obs, alpha=0.5, label="Ground truth", c="r")
plt.yticks(range(K))
plt.title("Values of X_1 ... X_N")
plt.xlabel("n")
plt.ylabel("Value of X_n")
plt.legend()
plt.show()
mus_tail_plot = mu_samples[-tail_len:].mean(0)
plt.scatter(range(K), mus_tail_plot, alpha=0.5, label="Inferred", c="b")
plt.scatter(range(K), mus_obs, alpha=0.5, label="Ground truth", c="r")
plt.xticks(range(K))
plt.title("Values of mu_1 ... mu_K")
plt.xlabel("k")
plt.ylabel("Value of mu_k")
plt.legend()
plt.show()
These plots indicate that inference seems to be recovering hidden states well, and is computing reasonably accurate values for .
Posterior likelihood checksβ
One way to evaluate posterior samples is computing likelihoods of our posterior samples, and comparing these to the likelihood of the underlying synthetic data. Formally, we can compute the joint likelihood of posterior samples with the observations used to generate the samples. And similarly, we can compute the joint likelihood of the observations with the underlying synthetic data which was generated at the same time as the observations.
def log_likelihood(xs, ys, thetas, mus, sigmas, N):
"""Returns the log likelihood of the HMM model conditioned on the data"""
result = 0
# transition probabilities
for n in range(1, N):
result += torch.log(thetas[xs[n - 1], xs[n]])
# emission probabilities
for n in range(N):
result += dist.Normal(mus[xs[n]], sigmas[xs[n]]).log_prob(ys[n])
return result
# computes the log likelihood of the HMM model per iteration
ppcs = [
log_likelihood(x, y_obs, thetas_obs, mu, sigma, N)
for x, mu, sigma in zip(x_samples.int(), mu_samples, sigma_samples)
]
plt.figure(figsize=(12, 6))
plt.plot(ppcs, label="Sample", c="g")
# plotting the ground truth for reference
plt.plot(
[log_likelihood(x_obs, y_obs, thetas_obs, mus_obs, sigmas_obs, N)] * num_samples,
label="Grond truth",
c="r",
)
plt.ylabel("Log likelihood")
plt.legend()
plt.show()
From the above plot, inference appears to be doing a good job of fitting the random variables given the observed data. Inference appears to converge with a log likelihood scores near to those generated by the ground truth parameters.