Newtonian Monte Carlo (NMC) (Arora, et al., 2020) is a second-order gradient-based Markov chain Monte Carlo (MCMC) algorithm that uses the first- and second-order gradients to propose a new value for a random variable.
Algorithmā
To understand the idea behind NMC, we revisit two other common MCMC algorithms, Random Walk and Metropolis-adjusted Langevin algorithm (MALA). In Random Walk, we propose the next parameter value ĪøāRk for our model using a simple Normal distribution centered at the current value: q(.ā£Īø)=N(Īø,Ļµ2Ikā). This Normal distribution has covariance Ļµ2Ikā where Ļµ is the user-specified step size and Ikā āis a kĆk identity matrix (Metropolis et al., 1953). In cases where the posterior density, Ļ(Īø), is differentiable, we can use the proposal distribution proposed by MALA which is q(.ā£Īø)=N(Īø+2Ļµ2āālogĻ(Īø),Ļµ2Ikā) (Robert & Tweedie, 1996). In MALA, we use a Normal distribution whose mean is 2Ļµ2āā away from the current value, Īø, in the direction of the first-order gradient of the posterior density, Ļ(Īø). In both Random Walk and MALA, we have the problem of finding the correct step size. A large step size may lead to an ineffective exploration of the space and low acceptance rate. A small step size can lead to a high acceptance rate but slow exploration of the space.
In NMC, we follow the same intuition. However, with NMC, we try to overcome the problem of choosing the step size by using second-order gradient information for Īø when choosing the parameters for the proposal distribution. Like Random Walk and MALA, NMC is free to use a Normal distribution as the proposal distribution. However, in Bean Machine, we can offer alternative proposal distributions in the case where Īø has a constrained support. Bean Machine offers several out of the box: Real-Space proposer, Half-Space proposer, and Simplex space proposer.
NMC Proposersā
This section details the different proposers that NMC can use out-of-the-box, as well as the ways that second-order gradient information is incorporated into their parameter selections.
Real-Space Proposerā
NMC uses a Multivariate Normal distribution to propose values in unconstrained space. Conceptually, NMC theorizes that the posterior is shaped like a Normal distribution. It uses first- and second-order gradient information from the current sample in order to build a proposal distribution whose shape matches that of this theoretical Normal distribution.
Mathematically, the proposal distribution q used as a function of the current sample location Īø is:
q(.ā£Īø)=N(Īøāā2logĻ(Īø)ā1ālogĻ(Īø),āā2logĻ(Īø)ā1) Adaptive Real-Space Proposerā
The above proposal distribution works well the closer that the posterior is to a Normal distribution. However, in cases when the posterior is quite dissimilar to a Normal distribution, this proposer may overfit to its Normal assumption. To alleviate this, we introduce a "learning rate" Ī³, whose purpose is to reduce the strength of that Normal assumption when choosing a mean for the proposal distribution. The learning rate must be fit during the adaptive phase of inference.
The proposal distribution with the learning rate is:
q(.ā£Īø)=N(ĪøāĪ³ā2logĻ(Īø)ā1ālogĻ(Īø),āā2logĻ(Īø)ā1) Ī³ā[0,1] is initially drawn from Beta(a,b), and an appropriate distribution for Ī³ is jointly learned with the rest of the model during inference ("Riemann manifold MALA", Section 5 of Girolami & Calderhead, 2011).
This proposer works for:
- Random variables with unconstrained space support.
- Random variables with constrained space support that are transformed into unconstrained space.
Half-Space Proposerā
For random variables constrained with half-space support, NMC uses Gamma as the base proposal distribution, and fits its parameters using local first- and second-order gradient information.
Mathematically, it is:
q(.ā£Īø)=Gamma(1āĪø2ā2logĻ(Īø),āĪøā2logĻ(Īø)āālogĻ(Īø)) Simplex Proposerā
For random variables constrained with simplex-valued support, NMC uses the Dirichlet distribution as the base proposal distribution, and fits its parameters using local first- and second-order gradient information.
For a k-dimensional simplex, the proposal distribution is:
q(.ā£Īø)Ī±iāā=Dirichlet(x;Ī±),Ī±āRk=1āĪøi2ā(āiiālogĻ(Īø)āiī =jmaxāāij2ālogĻ(Īø))ā Detailsā
Below are the NMC algorithm details.
Input:Ā posteriorĀ densityĀ ĻĀ definedĀ onĀ Īø1ā...Īøkā
Given:Ā a,bĀ (defaultĀ 1,Ā 1)
repeat
forĀ i=1Ā toĀ kĀ do:
ComputeĀ ālogĻ(Īø)Ā andĀ ā2logĻ(Īø)Ā w.r.t.Ā Īøiā
ComputeĀ qiā(.ā£Īø)Ā asĀ explainedĀ above
SampleĀ Īøāā¼qiā(.ā£Īø)
SetĀ Īøjāā=ĪøjāĀ forĀ iī =j
ComputeĀ ālogĻ(Īøā)Ā andĀ ā2logĻ(Īøā)Ā w.r.t.Ā Īøjāā
ComputeĀ qiā(.ā£Īøā)
Ī±=Ļ(Īø)qiā(Īøāā£Īø)Ļ(Īøā)qiā(Īøā£Īøā)ā
uā¼Uniform(0,1)
ifĀ u<Ī±Ā then
Īø=Īøā
endĀ if
endĀ for
OutputĀ sampleĀ Īø
untilĀ DesiredĀ numberĀ ofĀ samples
Below is NMC algorithm with adaptive phase added to compute the learning rate parameters a and b for the Beta distribution.
Input:Ā posteriorĀ densityĀ ĻĀ definedĀ onĀ Īø1ā...Īøkā
Given:Ā a,bĀ (defaultĀ 1,Ā 1)
repeat
forĀ i=1Ā toĀ kĀ do:
ComputeĀ ālogĻ(Īø)Ā andĀ ā2logĻ(Īø)Ā w.r.t.Ā Īøiā
ifĀ stillĀ inĀ warm-upĀ phaseĀ then
ComputeĀ a,bĀ fromĀ meanĀ andĀ varianceĀ ofĀ theĀ warmupĀ samplesĀ usingĀ methodĀ ofĀ moments
endĀ if
SampleĀ Ī³ā¼Beta(a,b)
ComputeĀ qiā(.ā£Īø)Ā asĀ explainedĀ above
SampleĀ Īøāā¼qiā(.ā£Īø)
SetĀ Īøjāā=ĪøjāĀ forĀ iī =j
ComputeĀ ālogĻ(Īøā)Ā andĀ ā2logĻ(Īøā)Ā w.r.t.Ā Īøjāā
ComputeĀ qiā(.ā£Īøā)
Ī±=Ļ(Īø)qiā(Īøāā£Īø)Ļ(Īøā)qiā(Īøā£Īøā)ā
uā¼Uniform(0,1)
ifĀ u<Ī±Ā then
Īø=Īøā
endĀ if
endĀ for
OutputĀ sampleĀ Īø
untilĀ DesiredĀ numberĀ ofĀ samples
The following code snippet illustrates how to use the inference method. Here, real_space_alpha
represents a and real_space_beta
represents b from the algorithm above.
samples = bm.SingleSiteNewtonianMonteCarlo(
real_space_alpha=1.0,
real_space_beta=5.0,
).infer(
queries=...,
observations=...,
num_samples=...,
num_chains=...,
)
To enable the adaptive phase in order to use a learning rate, we can set num_adaptive_samples
during inference.
samples = bm.SingleSiteNewtonianMonteCarlo(
real_space_alpha=1.0,
real_space_beta=5.0,
).infer(
queries=...,
observations=...,
num_samples=...,
num_chains=...,
num_adaptive_samples=num_warmup_samples,
)
Remember, for random variables with half-space and simplex support, SingleSiteNewtonianMonteCarlo
by default uses the half-space and simplex proposer respectively.
The parameters to infer
are described below:
Name | Usage |
---|
queries | A List of @bm.random_variable targets to fit posterior distributions for. |
observations | The Dict of observations. Each key is a random variable, and its value is the observed value for that random variable. |
num_samples | Number of samples to build up distributions for the values listed in queries . |
num_chains | Number of separate inference runs to use. Multiple chains can be used by diagnostics to verify inference ran correctly. |
num_adaptive_samples | Number of warmup samples to adapt the parameters. |
Arora, N. S., Tehrani, N. K., Shah, K. D., Tingley, M., Li, Y. L., Torabi, N., Noursi, D., Masouleh, S. A., Lippert, E., and Meijer, E. (2020). Newtonian monte carlo: single-site mcmc meets second-order gradient methods. arXiv:2001.05567.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953). āEquations of state calculations by fast computing machines.ā J. Chem. Phys., 21(6): 1087ā1092.
Robert, G., and Tweedie, R. 1996. Exponential convergence of Langevin diffusions and their discrete approximation. Bernoulli 2:341ā363.
Girolami, M., and Calderhead, B. (2011). āRiemann Manifold Langevin and Hamiltonian Monte Carlo Methods: Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods.ā Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73, no. 2 (March 2011): 123ā214. https://doi.org/10.1111/j.1467-9868.2010.00765.x.