Newtonian Monte Carlo (NMC) (Arora, et al., 2020) is a second-order gradient-based Markov chain Monte Carlo (MCMC) algorithm that uses the first- and second-order gradients to propose a new value for a random variable.
To understand the idea behind NMC, we revisit two other common MCMC algorithms, Random Walk and Metropolis-adjusted Langevin algorithm (MALA). In Random Walk, we propose the next parameter value for our model using a simple Normal distribution centered at the current value: . This Normal distribution has covariance where is the user-specified step size and is a identity matrix (Metropolis et al., 1953). In cases where the posterior density, , is differentiable, we can use the proposal distribution proposed by MALA which is (Robert & Tweedie, 1996). In MALA, we use a Normal distribution whose mean is away from the current value, , in the direction of the first-order gradient of the posterior density, . In both Random Walk and MALA, we have the problem of finding the correct step size. A large step size may lead to an ineffective exploration of the space and low acceptance rate. A small step size can lead to a high acceptance rate but slow exploration of the space.
In NMC, we follow the same intuition. However, with NMC, we try to overcome the problem of choosing the step size by using second-order gradient information for when choosing the parameters for the proposal distribution. Like Random Walk and MALA, NMC is free to use a Normal distribution as the proposal distribution. However, in Bean Machine, we can offer alternative proposal distributions in the case where has a constrained support. Bean Machine offers several out of the box: Real-Space proposer, Half-Space proposer, and Simplex space proposer.
This section details the different proposers that NMC can use out-of-the-box, as well as the ways that second-order gradient information is incorporated into their parameter selections.
NMC uses a Multivariate Normal distribution to propose values in unconstrained space. Conceptually, NMC theorizes that the posterior is shaped like a Normal distribution. It uses first- and second-order gradient information from the current sample in order to build a proposal distribution whose shape matches that of this theoretical Normal distribution.
Mathematically, the proposal distribution used as a function of the current sample location is:
Adaptive Real-Space Proposer
The above proposal distribution works well the closer that the posterior is to a Normal distribution. However, in cases when the posterior is quite dissimilar to a Normal distribution, this proposer may overfit to its Normal assumption. To alleviate this, we introduce a "learning rate" , whose purpose is to reduce the strength of that Normal assumption when choosing a mean for the proposal distribution. The learning rate must be fit during the adaptive phase of inference.
The proposal distribution with the learning rate is:
is initially drawn from , and an appropriate distribution for is jointly learned with the rest of the model during inference ("Riemann manifold MALA", Section 5 of Girolami & Calderhead, 2011).
This proposer works for:
- Random variables with unconstrained space support.
- Random variables with constrained space support that are transformed into unconstrained space.
For random variables constrained with half-space support, NMC uses as the base proposal distribution, and fits its parameters using local first- and second-order gradient information.
Mathematically, it is:
For random variables constrained with simplex-valued support, NMC uses the distribution as the base proposal distribution, and fits its parameters using local first- and second-order gradient information.
For a -dimensional simplex, the proposal distribution is:
Below are the NMC algorithm details.
Below is NMC algorithm with adaptive phase added to compute the learning rate parameters and for the distribution.
The following code snippet illustrates how to use the inference method. Here,
real_space_alpha represents and
real_space_beta represents from the algorithm above.
samples = bm.SingleSiteNewtonianMonteCarlo(
To enable the adaptive phase in order to use a learning rate, we can set
num_adaptive_samples during inference.
samples = bm.SingleSiteNewtonianMonteCarlo(
Remember, for random variables with half-space and simplex support,
SingleSiteNewtonianMonteCarlo by default uses the half-space and simplex proposer respectively.
The parameters to
infer are described below:
|Number of samples to build up distributions for the values listed in |
|Number of separate inference runs to use. Multiple chains can be used by diagnostics to verify inference ran correctly.|
|Number of warmup samples to adapt the parameters.|
Arora, N. S., Tehrani, N. K., Shah, K. D., Tingley, M., Li, Y. L., Torabi, N., Noursi, D., Masouleh, S. A., Lippert, E., and Meijer, E. (2020). Newtonian monte carlo: single-site mcmc meets second-order gradient methods. arXiv:2001.05567.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953). “Equations of state calculations by fast computing machines.” J. Chem. Phys., 21(6): 1087–1092.
Robert, G., and Tweedie, R. 1996. Exponential convergence of Langevin diffusions and their discrete approximation. Bernoulli 2:341–363.
Girolami, M., and Calderhead, B. (2011). “Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods: Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73, no. 2 (March 2011): 123–214. https://doi.org/10.1111/j.1467-9868.2010.00765.x.