Skip to main content

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) is a sampling algorithm for differentiable random variables which uses Hamiltonian dynamics. By randomly drawing a momentum for the kinetic energy and treating the true posterior as the potential energy, HMC is able to simulate trajectories which explore the space. Intuitively, this can be viewed as starting with a marble at a point inside a bowl, flicking the marble in a random direction, and then following the marble as it rolls around. The position of the marble represents the sample, the flick represents the momentum, and the shape of the bowl in combination with the force of gravity represents our true posterior.

Hamiltonian dynamics​

HMC applies Hamiltonian dynamics to explore the state space. This means that we think of the posterior surface as having some potential energy, proportional to the posterior's negative log likelihood at some particular set of parameter assignments. We use that potential energy, along with kinetic energy injected in the form of a random momentum "kick", to traverse the surface according to the laws of physics.

Below, we'll use qq to represent the current set of parameter assignments, and pp to represent a random momentum factor injected into the system by the inference method. The Hamiltonian physics alluded to above can be written as

H(q,p)=U(q)+K(p),H(q, p) = U(q) + K(p),

where UU and KK represent the potential and kinetic energy respectively.

The potential energy represents the shape of the posterior distribution, and is defined by

U(q)=βˆ’log⁑[Ο€(q)L(q∣D)].U(q) = -\log[\pi(q)L(q\mid D)].

Here, L(q∣D)L(q\mid D) is the likelihood of the model evaluated at the parameter assignments qq, conditioned on the observed dataset DD.

The potential energy can be evaluated directly in Bean Machine as a function of the model's posterior probability at the current parameter values.

The kinetic energy is defined using the momentum variable pp as well as an inference-method-defined covariance matrix Ξ£\Sigma,

K(p)=pTΞ£p/2.K(p) = p^T\Sigma p/2.

The values for pp and Ξ£\Sigma are provided by the framework. Bean Machine samples from a Normal distribution scaled to the appropriate posterior size for pp. Bean Machine can optionally estimate an appropriate constant value for Ξ£\Sigma by measuring correlations between parameters during an adaptation phase of inference.

We can then simulate the trajectory using the following forms of Hamilton's equations:

dqidt=[Ξ£p]idpidt=βˆ’βˆ‚Uβˆ‚qi\begin{aligned} \frac{dq_i}{dt} &= [\Sigma p]_i\\ \frac{dp_i}{dt} &= -\frac{\partial U}{\partial q_i} \end{aligned}

Bean Machine can compute βˆ‚Uβˆ‚qi\frac{\partial U}{\partial q_i} using autograd.

The goal of these equations is to determine where the new sample will come to rest after a framework-specified path length Ξ»\lambda. Both the potential energy (from the posterior's likelihood) and the kinetic energy (from the injected "kick") will interact with the sampled value to influence how it travels over the posterior surface.

caution

Since we are computing gradients, all of the latent (non-observed) variables must be continuous. For discrete variables, use CompositionalInference or marginalize them out as in the Zero inflated count data tutorial.

Approximating Hamiltonian dynamics​

Unfortunately, this system of differential equations cannot be analytically computed. Instead, Bean Machine discretizes these equations based on a discrete time step tt, and simulates how they influence each other in this discrete setting for a framework-specified path length Ξ»\lambda. This discrete simulation is referred to as the "leapfrog" algorithm. Each simulated time step is referred to as a "leapfrog step".

At a high level, leapfrog works like this:

  1. Choose momentum (pp) and covariance (Ξ£\Sigma) values to use for the simulation.
  2. Simulate a small time step of the momentum's influence on the sample.
  3. Simulate a small time step of the potential energy's influence on the sample.
  4. Repeat for the framework-specified path length Ξ»\lambda.

We'll represent the above description mathematically. For leapfrog step at time tt of size Ο΅\epsilon, we take a half-step for momentum and a full-step for potential energy using the updated momentum, and finally another half step for the momentum.

pi(t+Ο΅/2)=pβˆ’(Ο΅/2)βˆ‚Uβˆ‚qi(q(t))qi(t+Ο΅)=qi(t)+ϡΣpi(t+Ο΅/2)pi(t+Ο΅)=pi(t+Ο΅/2)βˆ’(Ο΅/2)βˆ‚Uβˆ‚qi(q(t+Ο΅))\begin{aligned} p_i(t + \epsilon/2) &= p - (\epsilon/2)\frac{\partial U}{\partial q_i}(q(t))\\ q_i(t + \epsilon) &= q_i(t) + \epsilon \Sigma p_i(t + \epsilon/2)\\ p_i(t + \epsilon) &= p_i(t + \epsilon/2) - (\epsilon/2)\frac{\partial U}{\partial q_i}(q(t + \epsilon)) \end{aligned}

This process is repeated until we have achieved the framework-specific distance Ξ»\lambda. Thus, the number of leapfrog steps is calculated by ⌈λ/Ο΅βŒ‰\lceil\lambda / \epsilon\rceil. The final sample for qq is chosen by the value of qq after the last leapfrog step.

Due to the discretization, the resulting trajectory will contain numerical errors. To account for the error, a Metropolis acceptance probability will be used to determine whether to accept the proposed value for qq.

HMC's performance is quite sensitive to the hyperparameters used. They are as follows:

  • Path length, Ξ»\lambda. Because the samples should be minimally correlated, it is ideal to follow the trajectory for long path lengths. However, distributions may have periodic behavior, and long path lengths may waste computation. The ideal path length is the minimal path length where the starting and ending samples have low correlation.
  • Step size, Ο΅\epsilon. If the Hamiltonian equations were followed exactly (not approximated), all samples would be accepted. However, error is introduced into the system during the discretization of Hamilton's equations. The larger the step size, the worse the final approximation will be; however, if the steps are too small, the number of steps needed as well as the overall runtime of the algorithm will increase.
  • Covariance (mass) matrix, Ξ£\Sigma. If there is significant correlations between the variables and the covariance matrix is not properly tuned, we will need to make smaller steps to get samples that are likely to be accepted.

The optimal acceptance rate for HMC, as derived by Neal (2011), is 0.65. It is desirable to tune these parameters so that they acheive this acceptance rate on average.

Adaptive Hamiltonian Monte Carlo​

Due to the challenge of selecting good hyperparameters, Bean Machine provides extensions to HMC to help choose appropriate values.

One such extension is called Adaptive Hamiltonian Monte Carlo. Adaptive HMC Adaptive HMC requires an adaptive phase, where Bean Machine uses the HMC algorithm to generate samples while tuning HMC's step size and covariance matrix. Adaptive HMC provides two main improvements over HMC, outlined below.

Users do not have to specify a step size. During the adaptive phase, the step size is adjusted in order to achieve the optimal acceptance rate of 0.65 on average. If the acceptance rate is above optimal, then Bean Machine is being too careful and discretizing in steps that are too small; therefore, the step size should be increased. If the acceptance rate is too low, then the step size should be decreased. We follow the Robbins-Monro stochastic approximation method, where earlier iterations within the adaptive phase have a larger influence over the step size than later iterations.

HMC can take different step sizes in different dimensions. During the adaptive phase, the momentum in each dimension is tuned depending on the covariance of the previously-accepted samples. The amount of momentum used is controlled by the covariance matrix, which allows HMC to move adjust the dimensions at different rates. The estimated covariance matrix is adjusted based on the covariance of the samples. Since the ideal covariance matrix is the true covariance, we can approximate this during the adaptive phase by using the covariance of the samples.

Once the adaptive phase ends, we no longer update our parameters, and the original HMC algorithm is used to generate new samples. Since the samples generated during the adaptive phase use untuned parameters, they may not be of the highest quality and are not returned by default.

Tuning the path length​

While adaptive HMC can effectively tun the step size and covariance matrix, Bean Machine relies on a separate algorithm for tuning the path length Ξ»\lambda. This algorithm is called the No-U-Turn Sampler, and has its own documentation page, No-U-Turn Samplers.

Usage​

In Bean Machine, inference using HMC can be specified as an inference method for all variables in the model:

bm.SingleSiteHamiltonianMonteCarlo(
1.0, # trajectory length
initial_step_size=0.1,
adapt_step_size=True
adapt_mass_matrix=True
target_accept_prob=0.8
).infer(
queries,
observations,
num_samples,
num_chains,
num_adaptive_samples=100,
)
caution

Make sure to set num_adaptive_samples when using adaptive HMC! If you forget to set num_adaptive_samples, no adaptation will occur.

The global variant for adaptive HMC, GlobalHamiltonianMonteCarlo, which proposes all of the variables in the model jointly, follows the same API:

bm.GlobalHamiltonianMonteCarlo(
1.0, # trajectory length
initial_step_size=0.1,
adapt_step_size=True,
adapt_mass_matrix=True,
target_accept_prob=0.8,
nnc_compile=True,
).infer(
queries,
observations,
num_samples,
num_chains,
num_adaptive_samples=100,
)
caution

Functorch's ahead of time (AOT) autograd compiler is used by default. If working with a non-static model or unexpected errors are encountered, you may need to manually disable the nnc_compile flag.

These arguments allow us to decide if we want to tune the step size adjust_step_size or covariance matrix adapt_mass_matrix. The target_accept_prob argument indicates the acceptance probability which should be targeted by the step size tuning algorithm. While the optimal value is 65.1%, higher values have been show to be more robust. As a result, Bean Machine targets an acceptance rate of 0.8 by default.

The parameters to infer are described below:

NameUsage
queriesA List of @bm.random_variable targets to fit posterior distributions for.
observationsThe Dict of observations. Each key is a random variable, and its value is the observed value for that random variable.
num_samplesNumber of samples to build up distributions for the values listed in queries.
num_chainsNumber of separate inference runs to use. Multiple chains can be used by diagnostics to verify inference ran correctly.
num_adaptive_samplesNumber of warmup samples to adapt the parameters.

Neal, Radford M. "MCMC using Hamiltonian dynamics." Handbook of markov chain monte carlo 2.11 (2011): 2.

Herbert Robbins. Sutton Monro. "A Stochastic Approximation Method." Ann. Math. Statist. 22 (3) 400 - 407, September, 1951. https://doi.org/10.1214/aoms/1177729586