Hidden Markov Model

Tutorial: Hidden Markov model

This tutorial demonstrates modeling and running inference on a hidden Markov model (HMM) in Bean Machine. The flexibility of this model allows us to demonstrate some of the great unique features of Bean Machine, such as block inference, compositional inference, and separation of data from the model.

Problem

HMMs are a class of probabilistic models which are popular for doing inference on discrete-time stochastic processes. In general, Markov models are used to study a sequence of random variables, $X_1,\ldots,X_N$ , where the sequence is "memoryless" such that the distribution of $X_{n}$ depends only on the value of $X_{n-1}$ ; any sequence which is memoryless is said to satisfy the Markov property. One reason Markov models are popular is because this flexible framework can be used to model many time-evolving processes, such as words in a sentence, the position of a robot, or the weather.

An HMM is a Markov model in which observations are modeled as being noisy. While we are interested in doing inference on each $X_n$ , we are actually observing variables $Y_1,\ldots,Y_N$ which can depend on the values of $X_1,\ldots,X_N$ in a variety of ways. In specific settings, HMMs can be very tractable, and lend themselves towards provably-optimal algorithms such as Kalman Filtering. Here, we illustrate how to do general inference with MCMC as applicable to general HMMs. The single-site algorithms underlying Bean Machine enable inference algorithms which scale favorably with the size of the HMM.

Prerequisites

Let's code this in Bean Machine! Import the Bean Machine library, some fundamental PyTorch classes, and optionally typing for our code.

# Install Bean Machine in Colab if using Colab.
import sys


if "google.colab" in sys.modules and "beanmachine" not in sys.modules:
    !pip install beanmachine

import logging
import os
import warnings

import beanmachine.ppl as bm
import matplotlib.pyplot as plt
import torch
import torch.distributions as dist

The next cell includes convenient configuration settings to improve the notebook presentation as well as setting a manual seed for reproducibility.

# Eliminate excess inference logging from Bean Machine, except for progress bars.
logging.getLogger("beanmachine").setLevel(50)
warnings.filterwarnings("ignore")

# Manual seed
bm.seed(111)

# Other settings for the notebook.
smoke_test = "SANDCASTLE_NEXUS" in os.environ or "CI" in os.environ

Model

We model the hidden state $X$ as being a discrete-time Markov chain with $K$ states and transition matrix (a.k.a. kernel) $\Theta$ . We model $N$ time steps of this chain with variables $X_1,\ldots,X_N$ , and use variables $Y_1,\ldots,Y_N$ to model observable emissions of each $X$ with Gaussian noise.

Formally, the transition and emission probabilities are as follows, for $n\in1,\ldots,N$ :

$X_{n+1}\mid X_n\sim\text{Categorical}(\Theta[X_n])$
$Y_n\mid X_n\sim\text{Normal}(\mu[X_n],\sigma[X_n])$

Accordingly, priors can be assigned as follows, for $k\in1,\ldots,K$ :

$\mu[k]\sim\text{Normal}(\mu_\text{loc},\mu_\text{scale})$
$\sigma[k]\sim\text{Gamma}(\sigma_\text{shape},\sigma_\text{rate})$
$\Theta[k]\sim\text{Dirichlet}([\frac{c}{K},\ldots,\frac{c}{K}])$

Finally, assume that the value of $X_1$ is known:

$X_1=0$

So the model is set by choosing the prior through the values of: $\mu_\text{loc},\mu_\text{scale},\sigma_\text{shape},\sigma_\text{rate},c$ . ( $c$ stands for concentration).

We can implement this model in Bean Machine by defining random variable objects with the @bm.random_variable decorator. These functions behave differently than ordinary Python functions.

Semantics for @bm.random_variable functions:

They must return PyTorch Distribution objects.
Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
Please see the documentation for more information about this decorator.

Note also that, compared to the statistical notation above, our implementation uses 0-indexing instead of 1-indexing.

class HiddenMarkovModel:
    def __init__(
        self,
        N: int,
        K: int,
        concentration: float,
        mu_loc: float,
        mu_scale: float,
        sigma_shape: float,
        sigma_rate: float,
    ) -> None:
        self.N = N
        self.K = K

        self.concentration = concentration
        self.mu_loc = mu_loc
        self.mu_scale = mu_scale
        self.sigma_shape = sigma_shape
        self.sigma_rate = sigma_rate

    @bm.random_variable
    def Theta(self, k):
        return dist.Dirichlet(torch.ones(self.K) * self.concentration / self.K)

    @bm.random_variable
    def Mu(self, k):
        return dist.Normal(self.mu_loc, self.mu_scale)

    @bm.random_variable
    def Sigma(self, k):
        return dist.Gamma(self.sigma_shape, self.sigma_rate)

    @bm.random_variable
    def X(self, n: int):
        if n == 0:
            return dist.Categorical(torch.tensor([1.0] + [0.0] * (self.K - 1)))
        else:
            return dist.Categorical(self.Theta(self.X(n - 1).item()))

    @bm.random_variable
    def Y(self, n: int):
        return dist.Normal(self.Mu(self.X(n).item()), self.Sigma(self.X(n).item()))

Data

First, we will generate random observations and choose the priors.

def generate_chain_observations(theta, mus, sigmas, N):
    theta_distbns = {j: dist.Categorical(vector) for j, vector in enumerate(theta)}
    hidden = [0]
    while len(hidden) < N:
        hidden.append(theta_distbns[hidden[-1]].sample().item())

    def observe(k):
        return dist.Normal(mus[k], sigmas[k]).sample().item()

    return hidden, list(map(observe, hidden))

concentration = 1.1
mu_loc = 0.0
mu_scale = 5.0
sigma_shape = 0.5
sigma_rate = 1.0

N = 15
K = 2

thetas_obs = dist.Dirichlet(torch.ones(K) * concentration / K).sample((K,))
mus_obs = dist.Normal(mu_loc, mu_scale).sample((K,))
sigmas_obs = dist.Gamma(sigma_shape, sigma_rate).sample((K,))


xs_obs, ys_obs = generate_chain_observations(thetas_obs, mus_obs, sigmas_obs, N)
x_obs = torch.tensor(xs_obs)
y_obs = torch.tensor(ys_obs)

# Initialize model
model = HiddenMarkovModel(
    N,
    K,
    concentration,
    mu_loc,
    mu_scale,
    sigma_shape,
    sigma_rate,
)

Inference

Inference is the process of combining model with data to obtain insights, in the form of probability distributions over values of interest. Bean Machine offers a powerful and general inference framework to enable fitting arbitrary models to data.

Our model makes use of both continuous and discrete random variables. We'll want to make use of different inference strategies for each. In particular, we would like to take advantage of gradient information for the continuous random variables. To facilitate this, Bean Machine provides the CompositionalInference class.

CompositionalInference is a powerful, flexible class for configuring inference in a variety of ways. By default, CompositionalInference will select an inference method for each random variable that is appropriate based on its support. The HMM presented in this tutorial has a number of different random variables that we're interested in learning about, and we will customize our own. The proposers for each family of random variables are summarized in the table below:

Random variable	Support	Inference method
$X$	$0,\ldots,K-1$	Uniform Metropolis Hastings
$\mu$	$(-\infty,\infty)$	Newtonian Monte Carlo (real space proposer)
$\sigma$	$[0,\infty)$	Newtonian Monte Carlo (half space proposer)

You can learn more about compositional inference in our framework topics.

Normally, this is all you would have to do! However, the HMM model has meaningful structure that we would like to consider when configuring our inference strategy. In particular, the hidden state of each time step, $X_n$ , is highly correlated with the state $X_{n-1}$ . Thus, we would also like to jointly propose new values for all $X$ — it is very likely that the value of a hidden state $X_n$ becomes invalid and needs to be recomputed after $X_{n-1}$ is updated. Similarly, we would like to update the location $\mu$ and the scale $\sigma$ of the hidden states jointly as well. In order to update these variables jointly, we can configure CompositionalInference to "block" the random variables together.

CompositionalInference accepts a dictionary that maps families of random variable to the corresponding algorithm, which allow you to override the default inference method for a particular subset of nodes, or group some of them into a block. To define a block, we simply need to pass CompositionalInference a tuple containing all random variable families that we want to propose jointly as a key. In our case, since we don't want to override the default inference method, we can use ... instead of providing an inference class for the block.

compositional = bm.CompositionalInference(
    {(model.X): ..., (model.Sigma, model.Mu): bm.SingleSiteNewtonianMonteCarlo()}
)

You may notice that we are using what we referred to as "random variable families" such as model.X as keys, which are essentially functions that generates the random variables, instead of using instantiated random variable like model.X(0) and model.X(1). This is because often times the number of random variables is not known ahead of time until an inference starts with some data (you can even have an unbounded number of nodes for some models). By using random variable families in the config, we no longer need to explicitly spell out all instances of the random variables and group them in a huge tuple.

The next step is to define the queries and observations. For this particular run, we're interested in inferring $X$ , $\mu$ , and $\sigma$ .

queries = (
    [model.X(n) for n in range(1, model.N)]
    + [model.Mu(k) for k in range(model.K)]
    + [model.Sigma(k) for k in range(model.K)]
)

observations = {
    model.X(0): torch.tensor(0.0),
    **{model.Y(n): y_obs[n] for n in range(model.N)},
    **{model.Theta(k): thetas_obs[k] for k in range(model.K)},
}

Running inference consists of a few arguments:

Name	Usage
`queries`	List of `@bm.random_variable` targets to fit posterior distributions for.
`observations`	A dictionary of observations, as built above.
`num_samples`	Number of Monte Carlo samples to approximate the posterior distributions for the variables in `queries`.
`num_chains`	Number of separate inference runs to use. Multiple chains can help verify that inference ran correctly.

num_samples = 400 if not smoke_test else 1
num_chains = 1

samples = compositional.infer(
    queries=queries,
    observations=observations,
    num_samples=num_samples,
    num_chains=num_chains,
)

samples = samples.get_chain(0)

Out:

Samples collected:   0%|          | 0/400 [00:00<?, ?it/s]

# X(0) is observed and is not part of query
x_samples = torch.stack(
    [torch.zeros(num_samples)] + [samples[model.X(n)] for n in range(1, model.N)], dim=1
)
mu_samples = torch.stack([samples[model.Mu(k)] for k in range(model.K)]).T
sigma_samples = torch.stack([samples[model.Sigma(k)] for k in range(model.K)]).T

Visualization

We will look at the values of the samples collected for $X$ and $\mu$ . We will take the mean of samples taken over the last 10% of the chain, and compare these to our synthetic data.

tail_len = num_samples // 10

xs_tail_plot = x_samples[-tail_len:].mean(0)
plt.scatter(
    range(len(xs_tail_plot)),
    xs_tail_plot,
    alpha=0.5,
    label="Inferred",
    c="b",
)
plt.scatter(range(len(xs_obs)), xs_obs, alpha=0.5, label="Ground truth", c="r")
plt.yticks(range(K))
plt.title("Values of X_1 ... X_N")
plt.xlabel("n")
plt.ylabel("Value of X_n")
plt.legend()
plt.show()

mus_tail_plot = mu_samples[-tail_len:].mean(0)
plt.scatter(range(K), mus_tail_plot, alpha=0.5, label="Inferred", c="b")
plt.scatter(range(K), mus_obs, alpha=0.5, label="Ground truth", c="r")
plt.xticks(range(K))
plt.title("Values of mu_1 ... mu_K")
plt.xlabel("k")
plt.ylabel("Value of mu_k")
plt.legend()
plt.show()

These plots indicate that inference seems to be recovering hidden states well, and is computing reasonably accurate values for $\mu$ .

Posterior likelihood checks

One way to evaluate posterior samples is computing likelihoods of our posterior samples, and comparing these to the likelihood of the underlying synthetic data. Formally, we can compute the joint likelihood of posterior samples with the observations used to generate the samples. And similarly, we can compute the joint likelihood of the observations with the underlying synthetic data which was generated at the same time as the observations.

def log_likelihood(xs, ys, thetas, mus, sigmas, N):
    """Returns the log likelihood of the HMM model conditioned on the data"""
    result = 0
    # transition probabilities
    for n in range(1, N):
        result += torch.log(thetas[xs[n - 1], xs[n]])
    # emission probabilities
    for n in range(N):
        result += dist.Normal(mus[xs[n]], sigmas[xs[n]]).log_prob(ys[n])
    return result


# computes the log likelihood of the HMM model per iteration
ppcs = [
    log_likelihood(x, y_obs, thetas_obs, mu, sigma, N)
    for x, mu, sigma in zip(x_samples.int(), mu_samples, sigma_samples)
]

plt.figure(figsize=(12, 6))
plt.plot(ppcs, label="Sample", c="g")
# plotting the ground truth for reference
plt.plot(
    [log_likelihood(x_obs, y_obs, thetas_obs, mus_obs, sigmas_obs, N)] * num_samples,
    label="Grond truth",
    c="r",
)
plt.ylabel("Log likelihood")
plt.legend()
plt.show()

From the above plot, inference appears to be doing a good job of fitting the random variables given the observed data. Inference appears to converge with a log likelihood scores near to those generated by the ground truth parameters.

Tutorial: Hidden Markov model​

Problem​

Prerequisites​

Model​

Data​

Inference​

Visualization​

Posterior likelihood checks​