Skip to main content

Newtonian Monte Carlo

Newtonian Monte Carlo (NMC) (Arora, et al., 2020) is a second-order gradient-based Markov chain Monte Carlo (MCMC) algorithm that uses the first- and second-order gradients to propose a new value for a random variable.

Algorithm

To understand the idea behind NMC, we revisit two other common MCMC algorithms, Random Walk and Metropolis-adjusted Langevin algorithm (MALA). In Random Walk, we propose the next parameter value θRk\theta\in\R^k for our model using a simple Normal distribution centered at the current value: q(.θ)=N(θ,ϵ2Ik)q(. \mid \theta) = \mathcal{N}(\theta, \epsilon^2I_k). This Normal distribution has covariance ϵ2Ik\epsilon^2I_k where ϵ\epsilon is the user-specified step size and IkI_k ​is a k×kk \times k identity matrix (Metropolis et al., 1953). In cases where the posterior density, π(θ)\pi(\theta), is differentiable, we can use the proposal distribution proposed by MALA which is q(.θ)=N(θ+ϵ22logπ(θ),ϵ2Ik)q(.|\theta)=\mathcal{N}(\theta+\frac{\epsilon^2}{2}\nabla \log\pi(\theta), \epsilon^2I_k) (Robert & Tweedie, 1996). In MALA, we use a Normal distribution whose mean is ϵ22\frac{\epsilon^2}{2}​ away from the current value, θ\theta, in the direction of the first-order gradient of the posterior density, π(θ)\pi(\theta). In both Random Walk and MALA, we have the problem of finding the correct step size. A large step size may lead to an ineffective exploration of the space and low acceptance rate. A small step size can lead to a high acceptance rate but slow exploration of the space.

In NMC, we follow the same intuition. However, with NMC, we try to overcome the problem of choosing the step size by using second-order gradient information for θ\theta when choosing the parameters for the proposal distribution. Like Random Walk and MALA, NMC is free to use a Normal distribution as the proposal distribution. However, in Bean Machine, we can offer alternative proposal distributions in the case where θ\theta has a constrained support. Bean Machine offers several out of the box: Real-Space proposer, Half-Space proposer, and Simplex space proposer.

NMC Proposers

This section details the different proposers that NMC can use out-of-the-box, as well as the ways that second-order gradient information is incorporated into their parameter selections.

Real-Space Proposer

NMC uses a Multivariate Normal distribution to propose values in unconstrained space. Conceptually, NMC theorizes that the posterior is shaped like a Normal distribution. It uses first- and second-order gradient information from the current sample in order to build a proposal distribution whose shape matches that of this theoretical Normal distribution.

Mathematically, the proposal distribution qq used as a function of the current sample location θ\theta is:

q(.θ)=N(θ2logπ(θ)1logπ(θ),2logπ(θ)1)q(. \mid \theta)=\mathcal{N}(\theta-\nabla^2 \log\pi(\theta)^{-1}\nabla \log\pi(\theta), -\nabla ^2\log\pi(\theta)^{-1})

Adaptive Real-Space Proposer

The above proposal distribution works well the closer that the posterior is to a Normal distribution. However, in cases when the posterior is quite dissimilar to a Normal distribution, this proposer may overfit to its Normal assumption. To alleviate this, we introduce a "learning rate" γ\gamma, whose purpose is to reduce the strength of that Normal assumption when choosing a mean for the proposal distribution. The learning rate must be fit during the adaptive phase of inference.

The proposal distribution with the learning rate is:

q(.θ)=N(θγ2logπ(θ)1logπ(θ),2logπ(θ)1)q(. \mid \theta)=\mathcal{N}(\theta-\gamma\nabla^2 \log\pi(\theta)^{-1}\nabla \log\pi(\theta), -\nabla ^2\log\pi(\theta)^{-1})

γ[0,1]\gamma\in[0, 1] is initially drawn from Beta(a,b)\text{Beta}(a, b), and an appropriate distribution for γ\gamma is jointly learned with the rest of the model during inference ("Riemann manifold MALA", Section 5 of Girolami & Calderhead, 2011).

This proposer works for:

  1. Random variables with unconstrained space support.
  2. Random variables with constrained space support that are transformed into unconstrained space.

Half-Space Proposer

For random variables constrained with half-space support, NMC uses Gamma\text{Gamma} as the base proposal distribution, and fits its parameters using local first- and second-order gradient information.

Mathematically, it is:

q(.θ)=Gamma(1θ22logπ(θ),θ2logπ(θ)logπ(θ))q(. \mid \theta)=\text{Gamma}(1-\theta^2\nabla^2\log\pi(\theta), -\theta\nabla^2\log\pi(\theta)-\nabla \log\pi(\theta))

Simplex Proposer

For random variables constrained with simplex-valued support, NMC uses the Dirichlet\text{Dirichlet} distribution as the base proposal distribution, and fits its parameters using local first- and second-order gradient information.

For a kk-dimensional simplex, the proposal distribution is:

q(.θ)=Dirichlet(x;α),αRkαi=1θi2(iilogπ(θ)maxijij2logπ(θ))\begin{aligned} q(. \mid \theta)&=\text{Dirichlet}(x;\alpha), \alpha \in \mathbb{R}^k\\ \alpha_i &= 1-\theta_i^2\left(\nabla_{ii}\log\pi(\theta)-\underset{i\neq j}{\max}\nabla_{ij}^2\log\pi(\theta)\right) \end{aligned}

Details

Below are the NMC algorithm details.

Input: posterior density π defined on θ1...θk\textbf{Input:} \text{ posterior density } \pi \text{ defined on } \theta_1...\theta_k\\ Given: a,b (default 1, 1)\textbf{Given: } a, b \text{ (default 1, 1)}\\ repeat\textbf{repeat}\\ for i=1 to k do:\quad \textbf{for } i=1\ \textbf{to}\ k\ \textbf{do}:\\ Compute logπ(θ) and 2logπ(θ) w.r.t. θi\quad\quad\text{Compute }\nabla \log\pi(\theta)\ \text{and}\ \nabla^2\log\pi(\theta)\ \text{w.r.t. }\theta_i\\ Compute qi(.θ) as explained above\quad\quad\text{Compute }q_i(.|\theta) \text{ as explained above}\\ Sample θqi(.θ)\quad\quad\text{Sample }\theta^*\sim q_i(.|\theta)\\ Set θj=θj for ij\quad\quad\text{Set }\theta^*_j=\theta_j\text{ for }i\neq j\\ Compute logπ(θ) and 2logπ(θ) w.r.t. θj\quad\quad\text{Compute }\nabla \log\pi(\theta^*)\text{ and }\nabla^2\log\pi(\theta^*)\text{ w.r.t. }\theta_j^*\\ Compute qi(.θ)\quad\quad\text{Compute }q_i(.|\theta^*)\\ α=π(θ)qi(θθ)π(θ)qi(θθ)\quad\quad\alpha=\frac{\pi(\theta^*)q_i(\theta|\theta^*)}{\pi(\theta)q_i(\theta^*|\theta)}\\ uUniform(0,1)\quad\quad u\sim \text{Uniform}(0,1)\\ if u<α then\quad\quad\textbf{if }u<\alpha\textbf{ then}\\ θ=θ\quad\quad\quad\theta=\theta^*\\ end if\quad\quad\textbf{end if}\\ end for\quad\textbf{end for}\\ Output sample θ\quad\text{Output sample }\theta\\ until Desired number of samples\textbf{until }\text{Desired number of samples}

Below is NMC algorithm with adaptive phase added to compute the learning rate parameters aa and bb for the Beta\text{Beta} distribution.

Input: posterior density π defined on θ1...θk\textbf{Input:} \text{ posterior density } \pi \text{ defined on } \theta_1...\theta_k\\ Given: a,b (default 1, 1)\textbf{Given: } a, b \text{ (default 1, 1)}\\ repeat\textbf{repeat}\\ for i=1 to k do:\quad \textbf{for } i=1\ \textbf{to}\ k\ \textbf{do}:\\ Compute logπ(θ) and 2logπ(θ) w.r.t. θi\quad\quad\text{Compute }\nabla \log\pi(\theta)\ \text{and}\ \nabla^2\log\pi(\theta)\ \text{w.r.t. }\theta_i\\ if still in warm-up phase then\quad\quad\textbf{if }\text{still in warm-up phase}\textbf{ then}\\ Compute a,b from mean and variance of the warmup samples using method of moments\qquad\qquad \text{Compute } a, b \text{ from mean and variance of the warmup samples using method of moments} \\ end if\qquad \textbf{end if}\\ Sample γBeta(a,b)\quad\quad\text{Sample }\gamma\sim \text{Beta}(a,b)\\ Compute qi(.θ) as explained above\quad\quad\text{Compute }q_i(.|\theta) \text{ as explained above}\\ Sample θqi(.θ)\quad\quad\text{Sample }\theta^*\sim q_i(.|\theta)\\ Set θj=θj for ij\quad\quad\text{Set }\theta^*_j=\theta_j\text{ for }i\neq j\\ Compute logπ(θ) and 2logπ(θ) w.r.t. θj\quad\quad\text{Compute }\nabla \log\pi(\theta^*)\text{ and }\nabla^2\log\pi(\theta^*)\text{ w.r.t. }\theta_j^*\\ Compute qi(.θ)\quad\quad\text{Compute }q_i(.|\theta^*)\\ α=π(θ)qi(θθ)π(θ)qi(θθ)\quad\quad\alpha=\frac{\pi(\theta^*)q_i(\theta|\theta^*)}{\pi(\theta)q_i(\theta^*|\theta)}\\ uUniform(0,1)\quad\quad u\sim \text{Uniform}(0,1)\\ if u<α then\quad\quad\textbf{if }u<\alpha\textbf{ then}\\ θ=θ\quad\quad\quad\theta=\theta^*\\ end if\quad\quad\textbf{end if}\\ end for\quad\textbf{end for}\\ Output sample θ\quad\text{Output sample }\theta\\ until Desired number of samples\textbf{until }\text{Desired number of samples}

Usage

The following code snippet illustrates how to use the inference method. Here, real_space_alpha represents aa and real_space_beta represents bb from the algorithm above.

samples = bm.SingleSiteNewtonianMonteCarlo(
real_space_alpha=1.0,
real_space_beta=5.0,
).infer(
queries=...,
observations=...,
num_samples=...,
num_chains=...,
)

To enable the adaptive phase in order to use a learning rate, we can set num_adaptive_samples during inference.

samples = bm.SingleSiteNewtonianMonteCarlo(
real_space_alpha=1.0,
real_space_beta=5.0,
).infer(
queries=...,
observations=...,
num_samples=...,
num_chains=...,
num_adaptive_samples=num_warmup_samples,
)

Remember, for random variables with half-space and simplex support, SingleSiteNewtonianMonteCarlo by default uses the half-space and simplex proposer respectively.

The parameters to infer are described below:

NameUsage
queriesA List of @bm.random_variable targets to fit posterior distributions for.
observationsThe Dict of observations. Each key is a random variable, and its value is the observed value for that random variable.
num_samplesNumber of samples to build up distributions for the values listed in queries.
num_chainsNumber of separate inference runs to use. Multiple chains can be used by diagnostics to verify inference ran correctly.
num_adaptive_samplesNumber of warmup samples to adapt the parameters.

Arora, N. S., Tehrani, N. K., Shah, K. D., Tingley, M., Li, Y. L., Torabi, N., Noursi, D., Masouleh, S. A., Lippert, E., and Meijer, E. (2020). Newtonian monte carlo: single-site mcmc meets second-order gradient methods. arXiv:2001.05567.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953). “Equations of state calculations by fast computing machines.” J. Chem. Phys., 21(6): 1087–1092.

Robert, G., and Tweedie, R. 1996. Exponential convergence of Langevin diffusions and their discrete approximation. Bernoulli 2:341–363.

Girolami, M., and Calderhead, B. (2011). “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods: Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73, no. 2 (March 2011): 123–214. https://doi.org/10.1111/j.1467-9868.2010.00765.x.