Bayesian Structural Time Series
Tutorial: Bayesian Structural Time Series Model
This tutorial demonstrates modeling and running inference on various Bayesian Structural Time Series (STS) models. We demonstrate some great features of Bean Machine such as: modeling serially correlated variables, the NUTS sampler applied to a global pool of variables, and providing custom distributions for inference.
import sys
import os
if 'google.colab' in sys.modules and 'beanmachine' not in sys.modules:
!pip install beanmachine
smoke_test = ('SANDCASTLE_NEXUS' in os.environ or 'CI' in os.environ)
Problem
Bayesian STS is a general class of additive models for Time Series series data; that have an associated State Space denoted by; which provides a probabilistic model for the serial correlation observed in the time series. For this tutorial, we will consider the case when takes values in a continous state space (For a discussion of the discrete state space case, refer to this tutorial on Hidden Markov Models).
Bayesian STS models typically contain, but are not limited to, some additive mixture of Error (), Trend (), and Seasonality () components: $$ Where each component is controlled either by some recursive relation of the state space, $\alpha_i$ or external covariates (i.e. Regression) such as day-of-week effects, business logic, or categories of the data. When STS just contains just the Error, Trend, and Seasonality components, it is sometimes referred to as the ETS model. Possibly the simpliest version of STS is called the local level model which is where we will start our discussion after some prerequisites.
Prerequisites
Let's code this in Bean Machine! Import the Bean Machine library, some fundamental PyTorch classes, and optionally typing for our code.
import logging
from typing import List
import beanmachine.ppl as bm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.distributions as dist
from beanmachine.ppl.model import RVIdentifier
logging.getLogger("beanmachine").setLevel(50)
bm.seed(111)
Local Level Model
In this simple local level version of an STS, is the sum of our state variable plus Gaussian Noise, . All of the time dependency is modeled in the evolution of , which is equal to its previous time value plus Gaussian Noise, that is independent of all 's. This is sometimes referred to a random walk because at each step, noise is injected to the previous state to arrive at a new state. To summarize, the random variables in play at this point are:
Most STS formulations are expressed in the state space representation, which is a coupled set of equations known as the measurement and state equations. Using our notation from before, the local level state space representation is:
Name | Formula |
---|---|
Measurement Equation | |
State Equation |
This can be represented more compactly as:
To make our STS "Bayesian", we will assign priors to , , and our initial state :
We can implement this model in Bean Machine by defining random variable objects with the
@bm.random_variable
decorator. These functions behave differently than ordinary Python
functions.
Semantics for @bm.random_variable
functions:
- They must return PyTorch
Distribution
objects. - Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
- Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
- Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
- Please see the documentation for more information about this decorator.
Note also that, compared to the statistical notation above, our implementation uses 0-indexing instead of 1-indexing.
class LocalLevelModel:
def __init__(
self,
N: int,
epsilon_shape: float,
epsilon_rate: float,
eta_shape: float,
eta_rate: float,
a_1: float,
p_1: float,
) -> None:
self.N = N
self.epsilon_shape = epsilon_shape
self.epsilon_rate = epsilon_rate
self.eta_shape = eta_shape
self.eta_rate = eta_rate
self.a_1 = a_1
self.p_1 = p_1
@bm.random_variable
def SigmaEpsilon(self) -> RVIdentifier:
return dist.Gamma(self.epsilon_shape, self.epsilon_rate)
@bm.random_variable
def SigmaEta(self) -> RVIdentifier:
return dist.Gamma(self.eta_shape, self.eta_rate)
@bm.random_variable
def Alpha(self, n: int) -> RVIdentifier:
if n == 0:
return dist.Normal(self.a_1, self.p_1)
else:
return dist.Normal(self.Alpha(n - 1), self.SigmaEta())
@bm.random_variable
def Y(self, n: int) -> RVIdentifier:
return dist.Normal(self.Alpha(n), self.SigmaEpsilon())
Data​
iclaims
is a dataset containing the weekly initial claims for US unemployment benefits
against a few related google trend queries (unemploy, filling and job)from Jan 2010 -
June 2018. This dataset was curated
here and credit goes
towards Uber's Orbit package.
def load_iclaims(end_date='2018-06-24', transform=True):
"""Load iclaims dataset
Returns
-------
pd.DataFrame
Notes
-----
iclaims is a dataset containing the weekly initial claims for US unemployment benefits against a few related google
trend queries (unemploy, filling and job)from Jan 2010 - June 2018. This aims to mimick the dataset from the paper
Predicting the Present with Bayesian Structural Time Series by SCOTT and VARIAN (2014).
Number of claims are obtained from [Federal Reserve Bank of St. Louis] while google queries are obtained through
Google Trends API.
Note that dataset is transformed by natural log before fitting in order to be fitted as a multiplicative model.
https://fred.stlouisfed.org/series/ICNSA
https://trends.google.com/trends/?geo=US
https://finance.yahoo.com/
"""
url = 'https://raw.githubusercontent.com/uber/orbit/master/examples/data/iclaims_example.csv'
df = pd.read_csv(url, parse_dates=['week'])
df = df[df['week'] <= end_date]
# standardize the regressors by mean; equivalent to subtracting mean after np.log
regressors = ['trend.unemploy', 'trend.filling', 'trend.job', 'sp500', 'vix']
# convert to float
for col in regressors:
df[col] = df[col].astype(float)
if transform:
# log transfer
df[['claims'] + regressors] = df[['claims'] + regressors].apply(np.log)
# de-mean
df[regressors] = df[regressors] - df[regressors].apply(np.mean)
return df
y_obs = torch.tensor(load_iclaims().claims.values)[:40]
plt.plot(y_obs)
plt.show()
Inference​
Inference is the process of combining model with data to obtain insights, in the form of probability distributions over values of interest. Bean Machine offers a powerful and general inference framework to enable fitting a arbitrary models to data. First lets setup our priors and instantiate our model.
# Prior Specification
epsilon_shape = 0.5
epsilon_rate = 1.0
eta_shape = 0.5
eta_rate = 1.0
a_1 = y_obs.mean().item()
p_1 = 1.0
# Length of our time series
N = len(y_obs)
# Instantiate the model and pass our length and priors
model = LocalLevelModel(
N=N,
epsilon_shape=epsilon_shape,
epsilon_rate=epsilon_rate,
eta_shape=eta_shape,
eta_rate=eta_rate,
a_1=a_1,
p_1=p_1
)
Running inference consists of a few arguments:
Name | Usage |
---|---|
queries | A list of @bm.random_variable targets to fit posterior distributions for. |
observations | The Dict of observations we built up, above. |
num_samples | Number of samples to build up distributions for the values listed in queries . |
num_chains | Number of separate inference runs to use. Multiple chains can verify inference ran correctly. |
The next step is to define the queries and observations. For this particular run, we're interested in infering , , and .
queries = [model.Alpha(n) for n in range(model.N)] + [model.SigmaEpsilon(), model.SigmaEta()]
observations = {**{model.Y(n): y_obs[n] for n in range(model.N)}}
For this particular problem, we will use the GlobalNoUTurnSampler inference method. Bean Machine's GlobalNoUTurnSampler is the NUTS sampler you have probably used before with stan or pymc3. We have chosen to use the NUTS sampler here because it can be easily compared to other probabilistic tools.
num_samples = 400 if not smoke_test else 1
num_adaptive_samples = 50
all_samples = bm.GlobalNoUTurnSampler(max_tree_depth=10).infer(
queries,
observations,
num_samples,
num_chains=1,
num_adaptive_samples=num_adaptive_samples
)
samples = all_samples.get_chain(0)
alpha_samples = torch.stack([samples[model.Alpha(n)] for n in range(model.N)], dim=1)
Visualization​
We will look at the values of the samples collected for and . We will take the mean of samples taken over the last 10% of the chain, and compare these to our data.
tail_len = num_samples // 10
ppc = np.percentile(
alpha_samples[-tail_len:].numpy(),
q=[5, 50, 95],
axis=0
).T
plt.figure(figsize=(12,6))
plt.plot(y_obs, label='Observations', color='green')
plt.plot(ppc[:, 1], label=r'Median of $\alpha_t$', color='red')
plt.plot(ppc[:, 0], 'b--', label=r'$5th$ Percentile')
plt.plot(ppc[:, 2], 'b--', label=r'$95th$ Percentile')
plt.title(r"Observations versus $\alpha_t$")
plt.legend()
plt.show()
It appears inference has recovered the latent states very well, and is accurately representing when sampled.
Now we investigate chain mixing by looking at the last half of the sampling chain. This type of plot is called a trace plot and is useful to visualize where the sampler is proposing steps.
plt.figure(figsize=(12,6))
[plt.plot(alpha_samples[-num_samples//2:, i], label=f'Time index {i}') for i in [0, 10, 20, 30, 39]]
plt.legend()
plt.title(r"Trace Plots for $\alpha_t$")
plt.show()
We see a clear downward trend for successive time indices. This matches our intuition from the "Observations versus " plot as we see the original time series with a downward trend.
Finally we look at the uncertainty estimates as a function of time. These are the posterior samples for both and .
plt.figure(figsize=(12,6))
plt.plot(samples[model.SigmaEta()], label=r"State Uncertainty, $\sigma_\eta$")
plt.plot(samples[model.SigmaEpsilon()], label=r"Measurement Uncertainty, $\sigma_\epsilon$")
plt.title("Uncertainty")
plt.legend()
plt.show()
We see the final noise samples are quite small, reflecting the fact that the original data did not contain alot of noise.
Posterior Likelihood Checks​
One way to evaluate posterior samples is computing likelihoods of our posterior samples, and comparing these to the likelihood of the underlying data. Formally, we can compute the joint likelihood of posterior samples with the observations.
def log_likelihood(alpha_samples, sigma_epsilon, y_obs, N):
ll = 0
for n in range(N):
ll += dist.Normal(alpha_samples[n], sigma_epsilon).log_prob(y_obs[n])
return ll
# Iterate over the batches of samples and compute log likelihood
lls = [
log_likelihood(alpha_samples, sigma_epsilon_sample, y_obs, N)
for alpha_samples, sigma_epsilon_sample in zip(alpha_samples, samples[model.SigmaEpsilon()])
]
plt.figure(figsize=(12,6))
plt.plot(lls, label="Sample", c="g")
plt.title("Posterior Likelihood Check")
plt.ylabel("Log likelihood")
plt.xlabel("Iteration")
plt.legend()
plt.show()
From the above plot, inference appears to be fitting almost perfectly. The random variables given the observed data is exceptionally high.
Vectorized Model
Now lets try a vectorized model by using a custom Distribution
. We first extend
pytorch's torch.distributions.Distribution
into a vectorized version of a random walk.
The key idea behind this implementation is that a
Gaussian Random Walk
can also be expressed as a chain of iid Gaussian random variables that have a cumsum
operator applied to them (thus making the "walking" behavior). Now with this cumsum
operator, we will be able to sample entire chains in one step!
from torch.distributions import Distribution, constraints
class GaussianRandomWalk(Distribution):
support = constraints.real
def __init__(
self,
loc: torch.Tensor,
scale: torch.Tensor,
num_steps: int = 1,
validate_args=None,
) -> None:
assert isinstance(num_steps, int) and num_steps > 0, "`num_steps` argument should be an positive integer."
self.loc = loc
self.scale = scale
self.num_steps = num_steps
batch_shape, event_shape = scale.size(), torch.Size([num_steps])
super(GaussianRandomWalk, self).__init__(batch_shape, event_shape, validate_args=validate_args)
def sample(self, sample_shape: List = None):
if not sample_shape:
sample_shape = []
assert isinstance(sample_shape, list), "`sample_shape` argument should be a list."
shape = torch.Size(sample_shape) + self.batch_shape + self.event_shape
loc = torch.zeros(shape)
loc[..., 0] = self.loc
walks = dist.Normal(loc=loc, scale=1).sample()
return torch.cumsum(walks, axis=-1) * self.scale.unsqueeze(dim=-1)
def log_prob(self, value):
init_prob = dist.Normal(self.loc, self.scale).log_prob(value[..., 0])
scale = self.scale.unsqueeze(dim=-1)
step_probs = dist.Normal(value[..., :-1], scale).log_prob(value[..., 1:])
return init_prob + torch.sum(step_probs, axis=-1)
@property
def mean(self):
return torch.zeros(self.batch_shape + self.event_shape) + self.loc
@property
def variance(self):
return torch.broadcast_to(
self.scale.unsqueeze(dim=-1) ** 2 * torch.arange(1, self.num_steps + 1),
self.batch_shape + self.event_shape,
)
In our implementation above, we will now use the sample()
and log_prob()
functions
to seamlessly integrate into Bean Machine! Specifically, we replace the Alpha()
function with our vectorized distribution rather than recursively calling the univariate
function from before.
class VectorizedLocalLevelModel:
def __init__(
self, N: int, epsilon_shape: float, epsilon_rate: float, eta_shape: float, eta_rate: float, a_1: float
) -> None:
self.N = N
self.epsilon_shape = epsilon_shape
self.epsilon_rate = epsilon_rate
self.eta_shape = eta_shape
self.eta_rate = eta_rate
self.a_1 = a_1
@bm.random_variable
def SigmaEpsilon(self) -> RVIdentifier:
return dist.Gamma(self.epsilon_shape, self.epsilon_rate)
@bm.random_variable
def SigmaEta(self) -> RVIdentifier:
return dist.Gamma(self.eta_shape, self.eta_rate)
@bm.random_variable
def Alpha(self) -> RVIdentifier:
return GaussianRandomWalk(loc=self.a_1, scale=self.SigmaEta(), num_steps=self.N) # Vectorized!
@bm.random_variable
def Y(self) -> RVIdentifier:
return dist.Normal(self.Alpha(), self.SigmaEpsilon()) # Vectorized!
We keep the rest of the workflow to be very similar to the unvectorized model so that we can directly compare the results. We complete inference, visualize our samples, and then compute some diagnostics of the fitted values.
vec_model = VectorizedLocalLevelModel(
N=N,
epsilon_shape=epsilon_shape,
epsilon_rate=epsilon_rate,
eta_shape=eta_shape,
eta_rate=eta_rate,
a_1=a_1
)
queries = [vec_model.Alpha(), vec_model.SigmaEpsilon(), vec_model.SigmaEta()]
observations = {**{vec_model.Y(): y_obs}}
One difference is that we now fit 4 chains (serially) in order to better investigate some diagnostic metrics. You can read more about this in the and cells located below.
num_samples = 400 if not smoke_test else 1
vec_all_samples = bm.GlobalNoUTurnSampler(max_tree_depth=10).infer(
queries,
observations,
num_samples,
num_chains=4,
num_adaptive_samples=num_adaptive_samples,
)
The speedup is quite significant using a vectorized approach! One of the main benefits of using Bean Machine is being able to incorporate this type of fast code.
vec_samples = vec_all_samples.get_chain(0)
vec_alpha_samples = vec_samples[vec_model.Alpha()]
vec_alpha_samples.size()
Visualization​
tail_len = num_samples // 10
vec_ppc = np.percentile(
vec_alpha_samples[-tail_len:].numpy(),
q=[5, 50, 95],
axis=0
).T
plt.figure(figsize=(12,6))
plt.plot(y_obs, label='Observations', color='green')
plt.plot(vec_ppc[:, 1], label=r'$\alpha_t$', color='red')
plt.plot(vec_ppc[:, 0], 'b--', label=r'$5th$ Percentile')
plt.plot(vec_ppc[:, 2], 'b--', label=r'$95th$ Percentile')
plt.title(r"Observations versus $\alpha_t$")
plt.legend()
plt.show()
plt.figure(figsize=(12,6))
[plt.plot(vec_alpha_samples[-num_samples//2:, i], label=f'Time index {i}') for i in [0, 10, 20, 30, 38]]
plt.legend()
plt.title(r"Trace Plots for $\alpha_t$")
plt.show()
plt.plot(vec_samples[vec_model.SigmaEta()], label=r"State Uncertainty, $\sigma_\eta$")
plt.plot(vec_samples[vec_model.SigmaEpsilon()], label=r"Measurement Uncertainty, $\sigma_\epsilon$")
plt.legend()
plt.show()
Diagnostics​
Two important statistics to take note of are the (r_hat
) and effective
sample size (ess
) values in the below dataframe, see Vehtari et al.
import arviz as az
summary_df = az.summary(vec_all_samples.to_xarray(), round_to=3)
summary_df.T
SigmaEpsilon(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,) | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[0] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[1] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[2] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[3] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[4] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[5] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[6] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[7] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[8] | ... | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[31] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[32] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[33] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[34] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[35] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[36] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[37] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[38] | Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[39] | SigmaEta(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | 0.029 | 13.373 | 13.592 | 13.394 | 13.157 | 13.192 | 13.147 | 13.088 | 13.04 | 13.067 | ... | 12.954 | 12.911 | 12.863 | 12.856 | 12.847 | 12.755 | 12.846 | 12.829 | 12.832 | 0.113 |
sd | 0.013 | 0.032 | 0.044 | 0.03 | 0.032 | 0.03 | 0.028 | 0.029 | 0.027 | 0.029 | ... | 0.029 | 0.028 | 0.028 | 0.029 | 0.027 | 0.031 | 0.029 | 0.029 | 0.029 | 0.014 |
hdi_3% | 0.009 | 13.315 | 13.517 | 13.341 | 13.096 | 13.135 | 13.094 | 13.036 | 12.991 | 13.014 | ... | 12.901 | 12.849 | 12.807 | 12.799 | 12.793 | 12.697 | 12.787 | 12.777 | 12.782 | 0.09 |
hdi_97% | 0.052 | 13.429 | 13.658 | 13.454 | 13.214 | 13.246 | 13.201 | 13.148 | 13.095 | 13.123 | ... | 13.008 | 12.962 | 12.915 | 12.908 | 12.899 | 12.812 | 12.897 | 12.884 | 12.892 | 0.14 |
mcse_mean | 0.002 | 0.002 | 0.005 | 0.001 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | ... | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
mcse_sd | 0.002 | 0.001 | 0.003 | 0.001 | 0.002 | 0.001 | 0.001 | 0.001 | 0 | 0 | ... | 0 | 0 | 0.001 | 0 | 0 | 0.001 | 0.001 | 0.001 | 0 | 0.001 |
ess_bulk | 21.341 | 496.575 | 124.038 | 1102.35 | 248.995 | 1768.7 | 664.539 | 1714.35 | 1763.9 | 2192.1 | ... | 1735.31 | 2262.76 | 1488.57 | 2141.01 | 1634.28 | 684.478 | 1478.13 | 1645.62 | 2114.61 | 222.529 |
ess_tail | 51.428 | 278.26 | 138.976 | 335.601 | 225.552 | 315.877 | 600.174 | 835.802 | 732.362 | 511.996 | ... | 463.851 | 331.508 | 595.205 | 642.63 | 562.712 | 373.118 | 625.599 | 679.928 | 815.075 | 448.152 |
r_hat | 1.179 | 1.02 | 1.035 | 1.026 | 1.021 | 1.026 | 1.021 | 1.024 | 1.032 | 1.011 | ... | 1.023 | 1.042 | 1.024 | 1.022 | 1.022 | 1.012 | 1.043 | 1.017 | 1.027 | 1.018 |
Diagnostic​
is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then , since the between-chain and within-chain variance should be equal. is calculated as
where is the within-chain variance and is the posterior variance estimate for the pooled rank-traces. The take-away here is that converges towards 1 when each of the Markov chains approaches perfect adaptation to the true posterior distribution. We do not recommend using inference results if . More information about can be found in the Vehtari et al paper.
Effective sample size diagnostic​
MCMC samplers do not draw independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper, briefly it is a measure that combines information from the value with the autocorrelation estimates within the chains. There are many ways to estimate effective samples sizes, however, we will be using the method defined in the Vehtari et al paper.
The rule of thumb for ess_bulk
is for this value to be greater than 100 per chain on
average. Since we ran four chains, we need ess_bulk
to be greater than 400 for each
parameter. The ess_tail
is an estimate for effectively independent samples considering
the more extreme values of the posterior. This is not the number of samples that landed
in the tails of the posterior. It is a measure of the number of effectively independent
samples if we sampled the tails of the posterior. The rule of thumb for this value is
also to be greater than 100 per chain on average.
Heteroscedastic Model
In the previous sections, one core assumption made about the data generation process is that , meaning that all of the noise in our measurement equation was sampled the same way, from the same distribution. In the Statistics literature, this is known as homoscedastic noise. Instead, we could postulate a second latent random variable, that would capture some of the serial correlation in the noise over time (as opposed to the level). This is referred to as a heteroscedastic noise. That is, we could say . One simple choice for is the same choice we made for , that is: $Where the definition of $\\epsilon_t doesn't change (or its associated prior, $\sigma_\epsilon$). Now, our model looks like this:
So that as the process goes on, the noise can now "walk" just like our level from
before. The last detail to attend to is the function . Since the Normal
distribution's scale
argument lives in , we need to apply a
transformation to for the scale
argument to be valid. One nice candidate is
torch.nn.Softplus()
which is a smooth approximation to the Relu
function. We will use the @bm.functional
wrapper to accomplish this deterministic mapping, creating an intermediate random
variable called Scale
in the process.
class HeteroLocalLevelModel:
def __init__(
self,
N: int,
epsilon_shape: float,
epsilon_rate: float,
eta_shape: float,
eta_rate: float,
a_1: float,
) -> None:
self.N = N
self.epsilon_shape = epsilon_shape
self.epsilon_rate = epsilon_rate
self.eta_shape = eta_shape
self.eta_rate = eta_rate
self.a_1 = a_1
self.p_1 = p_1
@bm.random_variable
def SigmaEpsilon(self) -> RVIdentifier:
return dist.Gamma(self.epsilon_shape, self.epsilon_rate)
@bm.random_variable
def SigmaEta(self) -> RVIdentifier:
return dist.Gamma(self.eta_shape, self.eta_rate)
@bm.random_variable
def Alpha(self) -> RVIdentifier:
return GaussianRandomWalk(loc=self.a_1, scale=self.SigmaEta(), num_steps=self.N) # Vectorized!
@bm.random_variable
def Beta(self) -> RVIdentifier:
return GaussianRandomWalk(loc=torch.Tensor([0.0]), scale=self.SigmaEpsilon(), num_steps=self.N) # Vectorized!
@bm.functional
def Scale(self) -> RVIdentifier:
return torch.nn.Softplus()(self.Beta()) # Transform Beta to a valid scale argument
@bm.random_variable
def Y(self) -> RVIdentifier:
return dist.Normal(self.Alpha(), self.Scale()) # Vectorized!
hetero_model = HeteroLocalLevelModel(
N=N,
epsilon_shape=epsilon_shape,
epsilon_rate=epsilon_rate,
eta_shape=eta_shape,
eta_rate=eta_rate,
a_1=a_1
)
queries = [hetero_model.Alpha(), hetero_model.Scale()]
observations = {**{hetero_model.Y(): y_obs}}
num_samples = 400 if not smoke_test else 1
hetero_all_samples = bm.GlobalNoUTurnSampler().infer(
queries, observations, num_samples, num_chains=1, num_adaptive_samples=num_adaptive_samples * 3
)
hetero_samples = hetero_all_samples.get_chain(0)
hetero_alpha_samples = hetero_samples[hetero_model.Alpha()]
hetero_scale_samples = hetero_samples[hetero_model.Scale()]
Visualization​
One last note before visualization, we are now computing uncertainty of both our level (), our noise (), and our observations . These "types" of uncertainty can be broadly cast into two seperate categories: aleatoric uncertainty and epistimic uncertainty.
A quick read on the topic can be found here but the gist is that aleatoric is the data uncertainty, and measures the inherent randomness in the observed data. This type of uncertainty would be appropriate for forming prediction intervals for new data. Epistimic uncertainty on the other hand, measures the uncertainty in a parameter estimate. So here, we are estimating the conditional mean () and conditional variance (). This type of uncertainty is useful for forming credible intervals. We first explore aleatoric uncertainty and then see how it is impacted by also considering epistemic uncertainty.
# Point estimates of Alpha and Beta
loc = hetero_alpha_samples[-tail_len:].mean(0)
scale = hetero_scale_samples[-tail_len:].mean(0)
aleatoric = dist.Normal(loc, scale)
quantiles = aleatoric.icdf(torch.Tensor([0.05, 0.95]).unsqueeze(-1))
# Form prediction interval with the point estimates
plt.figure(figsize=(12, 6))
plt.plot(y_obs, label="Observations", color="green")
plt.plot(loc, "r--", label=r"Mean of $\alpha_t$")
plt.fill_between(torch.arange(N), quantiles[0, ...], quantiles[1, ...], alpha=0.5, label="Prediction Interval")
plt.title(r"Aleatoric Uncertainty: Observations versus $\alpha_t$")
plt.legend()
plt.show()
As seen from the plot, the mean of appears much smoother than before. The
reason is because by fitting both the location and scale of
, we enable a more flexible fit to the data
(contrast this from before when SigmaEpsilon()
was a single parameter that was
supposed to estimate the noise for the entire sequence!). For example, when the data
becomes more noisy, our model can simply widen without overcorrecting
which ensures smoother connections with later points. Thus, the model is able
to more accurately detect a latent trend while still learning an evolving
(heteroskedastic) noise level. This is an example of learning the aleatoric
uncertainty (inherent randomness), which we display with a prediction interval.
Now we show the other type of uncertainty, epistemic uncertainty.
epistemic_alpha = np.quantile(hetero_alpha_samples[-tail_len:].detach().numpy(), [0.05, 0.95], axis=0)
# CI for Beta
plt.figure(figsize=(16, 10))
plt.fill_between(
torch.arange(N),
np.quantile((hetero_alpha_samples[-tail_len:] - hetero_scale_samples[-tail_len:]).detach().numpy(), 0.05, 0),
np.quantile((hetero_alpha_samples[-tail_len:] + hetero_scale_samples[-tail_len:]).detach().numpy(), 0.95, 0),
alpha=0.5,
label=r"95% Credible Interval for $\beta_t \pm \alpha_t$"
)
# Beta samples
plt.plot(hetero_alpha_samples[-tail_len:].T - hetero_scale_samples[-tail_len:].T, 'r--', linewidth=0.1)
plt.plot(hetero_alpha_samples[-tail_len:].T + hetero_scale_samples[-tail_len:].T, 'r--', linewidth=0.1)
plt.plot(hetero_alpha_samples[-tail_len:].T[0] + hetero_scale_samples[-tail_len:].T[0], 'r--', label=r'Samples of $\beta_t \pm \alpha_t$', linewidth=0.1)
# CI for Alpha
plt.fill_between(
torch.arange(N),
epistemic_alpha[0, ...],
epistemic_alpha[1, ...],
alpha=0.5,
color='red',
label=r"95% Credible Interval for $\alpha_t$"
)
# Samples for Alpha
plt.plot(hetero_alpha_samples[-tail_len:].T, 'b--', linewidth=0.3)
plt.plot(hetero_alpha_samples[-tail_len:].T[0, ...], 'b--', label=r'Samples of $\alpha_t$', linewidth=0.3)
plt.plot(y_obs, label="Observations", color="green", linewidth=3)
plt.title(r"Epistemic Uncertainty: Observations versus $\alpha_t$")
plt.legend()
plt.show()
Before we showed how there was uncertainty in where the data will be observed next
(aleatoric) which we encoded with a distribution
made up of point estimates of parameters Alpha
and Beta
. But remember, these
parameters are themselves estimated from the data and have additional uncertainty from
the parameter estimation process! This is the epistemic uncertainty. Rather than
show the point estimates, above we show a 95% credible interval for both and
at each time index. Note that 's uncertainty is superimposed (added
and subtracted) on top of since it is measured in different units. It is now
clear that by integrating in the epistemic uncertainty into our parameter estimates
for the aleatoric uncertainty, the corresponding prediction intervals would be a great
deal wider.
In either case, by using Bean Machine we can handle epistemic and aleatoric uncertainty quite easily with only about 1-2 lines of extra code!