Skip to main content

Bayesian Structural Time Series

Tutorial: Bayesian Structural Time Series Model

This tutorial demonstrates modeling and running inference on various Bayesian Structural Time Series (STS) models. We demonstrate some great features of Bean Machine such as: modeling serially correlated variables, the NUTS sampler applied to a global pool of variables, and providing custom distributions for inference.

import sys
import os

if 'google.colab' in sys.modules and 'beanmachine' not in sys.modules:
!pip install beanmachine

smoke_test = ('SANDCASTLE_NEXUS' in os.environ or 'CI' in os.environ)

Problem

Bayesian STS is a general class of additive models for Time Series series data; y1,y2,...,yny_1, y_2, ..., y_n that have an associated State Space denoted by; α1,α2,...,αn\alpha_1, \alpha_2, ..., \alpha_n which provides a probabilistic model for the serial correlation observed in the time series. For this tutorial, we will consider the case when αi\alpha_i takes values in a continous state space (For a discussion of the discrete state space case, refer to this tutorial on Hidden Markov Models).

Bayesian STS models typically contain, but are not limited to, some additive mixture of Error (ϵ\epsilon), Trend (μ\mu), and Seasonality (γ\gamma) components: $yt=μt+γt+ϵty_t = \mu_t + \gamma_t + \epsilon_t$ Where each component is controlled either by some recursive relation of the state space, $\alpha_i$ or external covariates (i.e. Regression) such as day-of-week effects, business logic, or categories of the data. When STS just contains just the Error, Trend, and Seasonality components, it is sometimes referred to as the ETS model. Possibly the simpliest version of STS is called the local level model which is where we will start our discussion after some prerequisites.

Prerequisites

Let's code this in Bean Machine! Import the Bean Machine library, some fundamental PyTorch classes, and optionally typing for our code.

import logging
from typing import List

import beanmachine.ppl as bm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.distributions as dist
from beanmachine.ppl.model import RVIdentifier


logging.getLogger("beanmachine").setLevel(50)
bm.seed(111)

Local Level Model

In this simple local level version of an STS, yty_t is the sum of our state variable αt\alpha_t plus Gaussian Noise, ϵt\epsilon_t. All of the time dependency is modeled in the evolution of αt+1\alpha_{t+1}, which is equal to its previous time value αt\alpha_t plus Gaussian Noise, ηt\eta_t that is independent of all ϵi\epsilon_i's. This is sometimes referred to a random walk because at each step, noise is injected to the previous state to arrive at a new state. To summarize, the random variables in play at this point are:

  • ϵt∼iidNormal(0,σϵ2)\epsilon_t \stackrel{iid}{\sim} \text{Normal}(0, \sigma_\epsilon^2)
  • ηt∼iidNormal(0,ση2)\eta_t \stackrel{iid}{\sim} \text{Normal}(0, \sigma_\eta^2)
  • ηt⊥ϵi,∀i∈{1,...,n}\eta_t \perp \epsilon_i, \forall i \in \{1,..., n\}

Most STS formulations are expressed in the state space representation, which is a coupled set of equations known as the measurement and state equations. Using our notation from before, the local level state space representation is:

NameFormula
Measurement Equationyt=αt+ϵty_t = \alpha_t + \epsilon_t
State Equationαt+1=αt+ηt\alpha_{t+1} = \alpha_t + \eta_t

This can be represented more compactly as:

  • αt+1∣αt∼Normal(αt,ση2)\alpha_{t+1} | \alpha_t \sim \text{Normal}(\alpha_{t}, \sigma_\eta^2)
  • yt∣αt∼Normal(αt,σe2)y_t | \alpha_t \sim \text{Normal}(\alpha_t, \sigma_e^2)

To make our STS "Bayesian", we will assign priors to σϵ\sigma_\epsilon, ση\sigma_\eta, and our initial state α1\alpha_1:

  • σϵ∼Gamma(ϵshape,ϵrate)\sigma_\epsilon \sim \text{Gamma}(\epsilon_{\text{shape}}, \epsilon_{\text{rate}})
  • ση∼Gamma(ηshape,ηrate)\sigma_\eta \sim \text{Gamma}(\eta_{\text{shape}}, \eta_{\text{rate}})
  • α1∼Normal(a1,p12)\alpha_1 \sim \text{Normal}(a_1, p_1^2)

We can implement this model in Bean Machine by defining random variable objects with the @bm.random_variable decorator. These functions behave differently than ordinary Python functions.

Semantics for @bm.random_variable functions:

  • They must return PyTorch Distribution objects.
  • Though they return distributions, callees actually receive samples from the distribution. The machinery for obtaining samples from distributions is handled internally by Bean Machine.
  • Inference runs the model through many iterations. During a particular inference iteration, a distinct random variable will correspond to exactly one sampled value: calls to the same random variable function with the same arguments will receive the same sampled value within one inference iteration. This makes it easy for multiple components of your model to refer to the same logical random variable.
  • Consequently, to define distinct random variables that correspond to different sampled values during a particular inference iteration, an effective practice is to add a dummy "indexing" parameter to the function. Distinct random variables can be referred to with different values for this index.
  • Please see the documentation for more information about this decorator.

Note also that, compared to the statistical notation above, our implementation uses 0-indexing instead of 1-indexing.

class LocalLevelModel:
def __init__(
self,
N: int,
epsilon_shape: float,
epsilon_rate: float,
eta_shape: float,
eta_rate: float,
a_1: float,
p_1: float,
) -> None:
self.N = N
self.epsilon_shape = epsilon_shape
self.epsilon_rate = epsilon_rate
self.eta_shape = eta_shape
self.eta_rate = eta_rate
self.a_1 = a_1
self.p_1 = p_1

@bm.random_variable
def SigmaEpsilon(self) -> RVIdentifier:
return dist.Gamma(self.epsilon_shape, self.epsilon_rate)

@bm.random_variable
def SigmaEta(self) -> RVIdentifier:
return dist.Gamma(self.eta_shape, self.eta_rate)

@bm.random_variable
def Alpha(self, n: int) -> RVIdentifier:
if n == 0:
return dist.Normal(self.a_1, self.p_1)
else:
return dist.Normal(self.Alpha(n - 1), self.SigmaEta())

@bm.random_variable
def Y(self, n: int) -> RVIdentifier:
return dist.Normal(self.Alpha(n), self.SigmaEpsilon())

Data​

iclaims is a dataset containing the weekly initial claims for US unemployment benefits against a few related google trend queries (unemploy, filling and job)from Jan 2010 - June 2018. This dataset was curated here and credit goes towards Uber's Orbit package.

def load_iclaims(end_date='2018-06-24', transform=True):
"""Load iclaims dataset
Returns
-------
pd.DataFrame
Notes
-----
iclaims is a dataset containing the weekly initial claims for US unemployment benefits against a few related google
trend queries (unemploy, filling and job)from Jan 2010 - June 2018. This aims to mimick the dataset from the paper
Predicting the Present with Bayesian Structural Time Series by SCOTT and VARIAN (2014).
Number of claims are obtained from [Federal Reserve Bank of St. Louis] while google queries are obtained through
Google Trends API.
Note that dataset is transformed by natural log before fitting in order to be fitted as a multiplicative model.
https://fred.stlouisfed.org/series/ICNSA
https://trends.google.com/trends/?geo=US
https://finance.yahoo.com/
"""
url = 'https://raw.githubusercontent.com/uber/orbit/master/examples/data/iclaims_example.csv'
df = pd.read_csv(url, parse_dates=['week'])
df = df[df['week'] <= end_date]

# standardize the regressors by mean; equivalent to subtracting mean after np.log
regressors = ['trend.unemploy', 'trend.filling', 'trend.job', 'sp500', 'vix']

# convert to float
for col in regressors:
df[col] = df[col].astype(float)

if transform:
# log transfer
df[['claims'] + regressors] = df[['claims'] + regressors].apply(np.log)
# de-mean
df[regressors] = df[regressors] - df[regressors].apply(np.mean)

return df
y_obs = torch.tensor(load_iclaims().claims.values)[:40]
plt.plot(y_obs)
plt.show()

Inference​

Inference is the process of combining model with data to obtain insights, in the form of probability distributions over values of interest. Bean Machine offers a powerful and general inference framework to enable fitting a arbitrary models to data. First lets setup our priors and instantiate our model.

# Prior Specification
epsilon_shape = 0.5
epsilon_rate = 1.0
eta_shape = 0.5
eta_rate = 1.0
a_1 = y_obs.mean().item()
p_1 = 1.0

# Length of our time series
N = len(y_obs)
# Instantiate the model and pass our length and priors
model = LocalLevelModel(
N=N,
epsilon_shape=epsilon_shape,
epsilon_rate=epsilon_rate,
eta_shape=eta_shape,
eta_rate=eta_rate,
a_1=a_1,
p_1=p_1
)

Running inference consists of a few arguments:

NameUsage
queriesA list of @bm.random_variable targets to fit posterior distributions for.
observationsThe Dict of observations we built up, above.
num_samplesNumber of samples to build up distributions for the values listed in queries.
num_chainsNumber of separate inference runs to use. Multiple chains can verify inference ran correctly.

The next step is to define the queries and observations. For this particular run, we're interested in infering α\alpha, σϵ\sigma_\epsilon, and ση\sigma_\eta.

queries = [model.Alpha(n) for n in range(model.N)] + [model.SigmaEpsilon(), model.SigmaEta()]
observations = {**{model.Y(n): y_obs[n] for n in range(model.N)}}

For this particular problem, we will use the GlobalNoUTurnSampler inference method. Bean Machine's GlobalNoUTurnSampler is the NUTS sampler you have probably used before with stan or pymc3. We have chosen to use the NUTS sampler here because it can be easily compared to other probabilistic tools.

num_samples = 400 if not smoke_test else 1
num_adaptive_samples = 50
all_samples = bm.GlobalNoUTurnSampler(max_tree_depth=10).infer(
queries,
observations,
num_samples,
num_chains=1,
num_adaptive_samples=num_adaptive_samples
)

samples = all_samples.get_chain(0)
Out:

Samples collected: 0%| | 0/450 [00:00<?, ?it/s]

alpha_samples = torch.stack([samples[model.Alpha(n)] for n in range(model.N)], dim=1)

Visualization​

We will look at the values of the samples collected for α\alpha and σϵ\sigma_\epsilon. We will take the mean of samples taken over the last 10% of the chain, and compare these to our data.

tail_len = num_samples // 10
ppc = np.percentile(
alpha_samples[-tail_len:].numpy(),
q=[5, 50, 95],
axis=0
).T

plt.figure(figsize=(12,6))
plt.plot(y_obs, label='Observations', color='green')
plt.plot(ppc[:, 1], label=r'Median of $\alpha_t$', color='red')
plt.plot(ppc[:, 0], 'b--', label=r'$5th$ Percentile')
plt.plot(ppc[:, 2], 'b--', label=r'$95th$ Percentile')
plt.title(r"Observations versus $\alpha_t$")
plt.legend()
plt.show()

It appears inference has recovered the latent states α\alpha very well, and is accurately representing YY when sampled.

Now we investigate chain mixing by looking at the last half of the sampling chain. This type of plot is called a trace plot and is useful to visualize where the sampler is proposing steps.

plt.figure(figsize=(12,6))
[plt.plot(alpha_samples[-num_samples//2:, i], label=f'Time index {i}') for i in [0, 10, 20, 30, 39]]
plt.legend()
plt.title(r"Trace Plots for $\alpha_t$")
plt.show()

We see a clear downward trend for successive time indices. This matches our intuition from the "Observations versus αt\alpha_t" plot as we see the original time series with a downward trend.

Finally we look at the uncertainty estimates as a function of time. These are the posterior samples for both ση\sigma_\eta and σϵ\sigma_\epsilon.

plt.figure(figsize=(12,6))
plt.plot(samples[model.SigmaEta()], label=r"State Uncertainty, $\sigma_\eta$")
plt.plot(samples[model.SigmaEpsilon()], label=r"Measurement Uncertainty, $\sigma_\epsilon$")
plt.title("Uncertainty")
plt.legend()
plt.show()

We see the final noise samples are quite small, reflecting the fact that the original data did not contain alot of noise.

Posterior Likelihood Checks​

One way to evaluate posterior samples is computing likelihoods of our posterior samples, and comparing these to the likelihood of the underlying data. Formally, we can compute the joint likelihood of posterior samples with the observations.

def log_likelihood(alpha_samples, sigma_epsilon, y_obs, N):
ll = 0
for n in range(N):
ll += dist.Normal(alpha_samples[n], sigma_epsilon).log_prob(y_obs[n])
return ll


# Iterate over the batches of samples and compute log likelihood
lls = [
log_likelihood(alpha_samples, sigma_epsilon_sample, y_obs, N)
for alpha_samples, sigma_epsilon_sample in zip(alpha_samples, samples[model.SigmaEpsilon()])
]
plt.figure(figsize=(12,6))
plt.plot(lls, label="Sample", c="g")
plt.title("Posterior Likelihood Check")
plt.ylabel("Log likelihood")
plt.xlabel("Iteration")

plt.legend()
plt.show()

From the above plot, inference appears to be fitting almost perfectly. The random variables given the observed data is exceptionally high.

Vectorized Model

Now lets try a vectorized model by using a custom Distribution. We first extend pytorch's torch.distributions.Distribution into a vectorized version of a random walk. The key idea behind this implementation is that a Gaussian Random Walk can also be expressed as a chain of iid Gaussian random variables that have a cumsum operator applied to them (thus making the "walking" behavior). Now with this cumsum operator, we will be able to sample entire chains in one step!

from torch.distributions import Distribution, constraints


class GaussianRandomWalk(Distribution):
support = constraints.real

def __init__(
self,
loc: torch.Tensor,
scale: torch.Tensor,
num_steps: int = 1,
validate_args=None,
) -> None:
assert isinstance(num_steps, int) and num_steps > 0, "`num_steps` argument should be an positive integer."
self.loc = loc
self.scale = scale
self.num_steps = num_steps
batch_shape, event_shape = scale.size(), torch.Size([num_steps])
super(GaussianRandomWalk, self).__init__(batch_shape, event_shape, validate_args=validate_args)

def sample(self, sample_shape: List = None):
if not sample_shape:
sample_shape = []
assert isinstance(sample_shape, list), "`sample_shape` argument should be a list."
shape = torch.Size(sample_shape) + self.batch_shape + self.event_shape
loc = torch.zeros(shape)
loc[..., 0] = self.loc
walks = dist.Normal(loc=loc, scale=1).sample()
return torch.cumsum(walks, axis=-1) * self.scale.unsqueeze(dim=-1)

def log_prob(self, value):
init_prob = dist.Normal(self.loc, self.scale).log_prob(value[..., 0])
scale = self.scale.unsqueeze(dim=-1)
step_probs = dist.Normal(value[..., :-1], scale).log_prob(value[..., 1:])
return init_prob + torch.sum(step_probs, axis=-1)

@property
def mean(self):
return torch.zeros(self.batch_shape + self.event_shape) + self.loc

@property
def variance(self):
return torch.broadcast_to(
self.scale.unsqueeze(dim=-1) ** 2 * torch.arange(1, self.num_steps + 1),
self.batch_shape + self.event_shape,
)

In our implementation above, we will now use the sample() and log_prob() functions to seamlessly integrate into Bean Machine! Specifically, we replace the Alpha() function with our vectorized distribution rather than recursively calling the univariate function from before.

class VectorizedLocalLevelModel:
def __init__(
self, N: int, epsilon_shape: float, epsilon_rate: float, eta_shape: float, eta_rate: float, a_1: float
) -> None:
self.N = N
self.epsilon_shape = epsilon_shape
self.epsilon_rate = epsilon_rate
self.eta_shape = eta_shape
self.eta_rate = eta_rate
self.a_1 = a_1

@bm.random_variable
def SigmaEpsilon(self) -> RVIdentifier:
return dist.Gamma(self.epsilon_shape, self.epsilon_rate)

@bm.random_variable
def SigmaEta(self) -> RVIdentifier:
return dist.Gamma(self.eta_shape, self.eta_rate)

@bm.random_variable
def Alpha(self) -> RVIdentifier:
return GaussianRandomWalk(loc=self.a_1, scale=self.SigmaEta(), num_steps=self.N) # Vectorized!

@bm.random_variable
def Y(self) -> RVIdentifier:
return dist.Normal(self.Alpha(), self.SigmaEpsilon()) # Vectorized!

We keep the rest of the workflow to be very similar to the unvectorized model so that we can directly compare the results. We complete inference, visualize our samples, and then compute some diagnostics of the fitted values.

vec_model = VectorizedLocalLevelModel(
N=N,
epsilon_shape=epsilon_shape,
epsilon_rate=epsilon_rate,
eta_shape=eta_shape,
eta_rate=eta_rate,
a_1=a_1
)

queries = [vec_model.Alpha(), vec_model.SigmaEpsilon(), vec_model.SigmaEta()]
observations = {**{vec_model.Y(): y_obs}}

One difference is that we now fit 4 chains (serially) in order to better investigate some diagnostic metrics. You can read more about this in the R^\hat{R} and essess cells located below.

num_samples = 400 if not smoke_test else 1
vec_all_samples = bm.GlobalNoUTurnSampler(max_tree_depth=10).infer(
queries,
observations,
num_samples,
num_chains=4,
num_adaptive_samples=num_adaptive_samples,
)
Out:

Samples collected: 0%| | 0/450 [00:00<?, ?it/s]

Samples collected: 0%| | 0/450 [00:00<?, ?it/s]

Samples collected: 0%| | 0/450 [00:00<?, ?it/s]

Samples collected: 0%| | 0/450 [00:00<?, ?it/s]

The speedup is quite significant using a vectorized approach! One of the main benefits of using Bean Machine is being able to incorporate this type of fast code.

vec_samples = vec_all_samples.get_chain(0)
vec_alpha_samples = vec_samples[vec_model.Alpha()]
vec_alpha_samples.size()
Out:

torch.Size([400, 40])

Visualization​

tail_len = num_samples // 10

vec_ppc = np.percentile(
vec_alpha_samples[-tail_len:].numpy(),
q=[5, 50, 95],
axis=0
).T

plt.figure(figsize=(12,6))
plt.plot(y_obs, label='Observations', color='green')
plt.plot(vec_ppc[:, 1], label=r'$\alpha_t$', color='red')
plt.plot(vec_ppc[:, 0], 'b--', label=r'$5th$ Percentile')
plt.plot(vec_ppc[:, 2], 'b--', label=r'$95th$ Percentile')
plt.title(r"Observations versus $\alpha_t$")
plt.legend()
plt.show()

plt.figure(figsize=(12,6))
[plt.plot(vec_alpha_samples[-num_samples//2:, i], label=f'Time index {i}') for i in [0, 10, 20, 30, 38]]
plt.legend()
plt.title(r"Trace Plots for $\alpha_t$")
plt.show()

plt.plot(vec_samples[vec_model.SigmaEta()], label=r"State Uncertainty, $\sigma_\eta$")
plt.plot(vec_samples[vec_model.SigmaEpsilon()], label=r"Measurement Uncertainty, $\sigma_\epsilon$")
plt.legend()
plt.show()

Diagnostics​

Two important statistics to take note of are the R^\hat{R} (r_hat) and effective sample size (ess) values in the below dataframe, see Vehtari et al.

import arviz as az
summary_df = az.summary(vec_all_samples.to_xarray(), round_to=3)
summary_df.T
SigmaEpsilon(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[0]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[1]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[2]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[3]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[4]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[5]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[6]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[7]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[8]...Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[31]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[32]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[33]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[34]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[35]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[36]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[37]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[38]Alpha(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)[39]SigmaEta(<main.VectorizedLocalLevelModel object at 0x7fee105f6f50>,)
mean0.02913.37313.59213.39413.15713.19213.14713.08813.0413.067...12.95412.91112.86312.85612.84712.75512.84612.82912.8320.113
sd0.0130.0320.0440.030.0320.030.0280.0290.0270.029...0.0290.0280.0280.0290.0270.0310.0290.0290.0290.014
hdi_3%0.00913.31513.51713.34113.09613.13513.09413.03612.99113.014...12.90112.84912.80712.79912.79312.69712.78712.77712.7820.09
hdi_97%0.05213.42913.65813.45413.21413.24613.20113.14813.09513.123...13.00812.96212.91512.90812.89912.81212.89712.88412.8920.14
mcse_mean0.0020.0020.0050.0010.0020.0010.0010.0010.0010.001...0.0010.0010.0010.0010.0010.0010.0010.0010.0010.001
mcse_sd0.0020.0010.0030.0010.0020.0010.0010.00100...000.001000.0010.0010.00100.001
ess_bulk21.341496.575124.0381102.35248.9951768.7664.5391714.351763.92192.1...1735.312262.761488.572141.011634.28684.4781478.131645.622114.61222.529
ess_tail51.428278.26138.976335.601225.552315.877600.174835.802732.362511.996...463.851331.508595.205642.63562.712373.118625.599679.928815.075448.152
r_hat1.1791.021.0351.0261.0211.0261.0211.0241.0321.011...1.0231.0421.0241.0221.0221.0121.0431.0171.0271.018

R^\hat{R} Diagnostic​

R^\hat{R} is a diagnostic tool that measures the between- and within-chain variances. It is a test that indicates a lack of convergence by comparing the variance between multiple chains to the variance within each chain. If the parameters are successfully exploring the full space for each chain, then R^≈1\hat{R}\approx 1, since the between-chain and within-chain variance should be equal. R^\hat{R} is calculated as

R^=V^W\hat{R}=\frac{\hat{V}}{W}

where WW is the within-chain variance and V^\hat{V} is the posterior variance estimate for the pooled rank-traces. The take-away here is that R^\hat{R} converges towards 1 when each of the Markov chains approaches perfect adaptation to the true posterior distribution. We do not recommend using inference results if R^>1.01\hat{R}>1.01. More information about R^\hat{R} can be found in the Vehtari et al paper.

Effective sample size essess diagnostic​

MCMC samplers do not draw independent samples from the target distribution, which means that our samples are correlated. In an ideal situation all samples would be independent, but we do not have that luxury. We can, however, measure the number of effectively independent samples we draw, which is called the effective sample size. You can read more about how this value is calculated in the Vehtari et al paper, briefly it is a measure that combines information from the R^\hat{R} value with the autocorrelation estimates within the chains. There are many ways to estimate effective samples sizes, however, we will be using the method defined in the Vehtari et al paper.

The rule of thumb for ess_bulk is for this value to be greater than 100 per chain on average. Since we ran four chains, we need ess_bulk to be greater than 400 for each parameter. The ess_tail is an estimate for effectively independent samples considering the more extreme values of the posterior. This is not the number of samples that landed in the tails of the posterior. It is a measure of the number of effectively independent samples if we sampled the tails of the posterior. The rule of thumb for this value is also to be greater than 100 per chain on average.

Heteroscedastic Model

In the previous sections, one core assumption made about the data generation process is that ϵt∼iidNormal(0,σϵ2)\epsilon_t \stackrel{iid}{\sim} \text{Normal}(0, \sigma_\epsilon^2), meaning that all of the noise in our measurement equation was sampled the same way, from the same distribution. In the Statistics literature, this is known as homoscedastic noise. Instead, we could postulate a second latent random variable, βt\beta_t that would capture some of the serial correlation in the noise over time (as opposed to the level). This is referred to as a heteroscedastic noise. That is, we could say β=g(βt−1)\beta = g(\beta_{t - 1}). One simple choice for g(⋅)g(\cdot) is the same choice we made for αt\alpha_t, that is: $βt+1=βt+ϵt\beta_{t + 1} = \beta_{t} + \epsilon_{t}Where the definition of $\\epsilon_t doesn't change (or its associated prior, $\sigma_\epsilon$). Now, our model looks like this:

  • αt+1∣αt∼Normal(αt,ση2)\alpha_{t+1} | \alpha_t \sim \text{Normal}(\alpha_{t}, \sigma_\eta^2)
  • βt+1∣βt∼Normal(βt,σϵ2)\beta_{t+1} | \beta_t \sim \text{Normal}(\beta_{t}, \sigma_\epsilon^2)
  • yt∣αt∼Normal(αt,f(βt2))y_t | \alpha_t \sim \text{Normal}(\alpha_t, f(\beta_t^2))

So that as the process goes on, the noise can now "walk" just like our level from before. The last detail to attend to is the function f(⋅)f(\cdot). Since the Normal distribution's scale argument lives in R+\mathbb R^+, we need to apply a transformation to βt\beta_t for the scale argument to be valid. One nice candidate is torch.nn.Softplus() which is a smooth approximation to the Relu function. We will use the @bm.functional wrapper to accomplish this deterministic mapping, creating an intermediate random variable called Scale in the process.

class HeteroLocalLevelModel:
def __init__(
self,
N: int,
epsilon_shape: float,
epsilon_rate: float,
eta_shape: float,
eta_rate: float,
a_1: float,
) -> None:
self.N = N
self.epsilon_shape = epsilon_shape
self.epsilon_rate = epsilon_rate
self.eta_shape = eta_shape
self.eta_rate = eta_rate
self.a_1 = a_1
self.p_1 = p_1

@bm.random_variable
def SigmaEpsilon(self) -> RVIdentifier:
return dist.Gamma(self.epsilon_shape, self.epsilon_rate)

@bm.random_variable
def SigmaEta(self) -> RVIdentifier:
return dist.Gamma(self.eta_shape, self.eta_rate)

@bm.random_variable
def Alpha(self) -> RVIdentifier:
return GaussianRandomWalk(loc=self.a_1, scale=self.SigmaEta(), num_steps=self.N) # Vectorized!

@bm.random_variable
def Beta(self) -> RVIdentifier:
return GaussianRandomWalk(loc=torch.Tensor([0.0]), scale=self.SigmaEpsilon(), num_steps=self.N) # Vectorized!

@bm.functional
def Scale(self) -> RVIdentifier:
return torch.nn.Softplus()(self.Beta()) # Transform Beta to a valid scale argument

@bm.random_variable
def Y(self) -> RVIdentifier:
return dist.Normal(self.Alpha(), self.Scale()) # Vectorized!
hetero_model = HeteroLocalLevelModel(
N=N,
epsilon_shape=epsilon_shape,
epsilon_rate=epsilon_rate,
eta_shape=eta_shape,
eta_rate=eta_rate,
a_1=a_1
)

queries = [hetero_model.Alpha(), hetero_model.Scale()]
observations = {**{hetero_model.Y(): y_obs}}
num_samples = 400 if not smoke_test else 1
hetero_all_samples = bm.GlobalNoUTurnSampler().infer(
queries, observations, num_samples, num_chains=1, num_adaptive_samples=num_adaptive_samples * 3
)
Out:

Samples collected: 0%| | 0/550 [00:00<?, ?it/s]

hetero_samples = hetero_all_samples.get_chain(0)
hetero_alpha_samples = hetero_samples[hetero_model.Alpha()]
hetero_scale_samples = hetero_samples[hetero_model.Scale()]

Visualization​

One last note before visualization, we are now computing uncertainty of both our level (αt\alpha_t), our noise (βt\beta_t), and our observations yty_t. These "types" of uncertainty can be broadly cast into two seperate categories: aleatoric uncertainty and epistimic uncertainty.

A quick read on the topic can be found here but the gist is that aleatoric is the data uncertainty, and measures the inherent randomness in the observed data. This type of uncertainty would be appropriate for forming prediction intervals for new data. Epistimic uncertainty on the other hand, measures the uncertainty in a parameter estimate. So here, we are estimating the conditional mean (αt\alpha_t) and conditional variance (f(βt)2f(\beta_t)^2). This type of uncertainty is useful for forming credible intervals. We first explore aleatoric uncertainty and then see how it is impacted by also considering epistemic uncertainty.

# Point estimates of Alpha and Beta
loc = hetero_alpha_samples[-tail_len:].mean(0)
scale = hetero_scale_samples[-tail_len:].mean(0)
aleatoric = dist.Normal(loc, scale)
quantiles = aleatoric.icdf(torch.Tensor([0.05, 0.95]).unsqueeze(-1))

# Form prediction interval with the point estimates
plt.figure(figsize=(12, 6))
plt.plot(y_obs, label="Observations", color="green")
plt.plot(loc, "r--", label=r"Mean of $\alpha_t$")
plt.fill_between(torch.arange(N), quantiles[0, ...], quantiles[1, ...], alpha=0.5, label="Prediction Interval")
plt.title(r"Aleatoric Uncertainty: Observations versus $\alpha_t$")
plt.legend()
plt.show()

As seen from the plot, the mean of αt\alpha_t appears much smoother than before. The reason is because by fitting both the location and scale of Normal(αt,f(βt))\text{Normal}(\alpha_t, f(\beta_t)), we enable a more flexible fit to the data (contrast this from before when SigmaEpsilon() was a single parameter that was supposed to estimate the noise for the entire sequence!). For example, when the data becomes more noisy, our model can simply widen βt\beta_t without overcorrecting αt\alpha_t which ensures smoother connections with later points. Thus, the model is able to more accurately detect a latent trend while still learning an evolving (heteroskedastic) noise level. This is an example of learning the aleatoric uncertainty (inherent randomness), which we display with a prediction interval.

Now we show the other type of uncertainty, epistemic uncertainty.

epistemic_alpha = np.quantile(hetero_alpha_samples[-tail_len:].detach().numpy(), [0.05, 0.95], axis=0)

# CI for Beta
plt.figure(figsize=(16, 10))
plt.fill_between(
torch.arange(N),
np.quantile((hetero_alpha_samples[-tail_len:] - hetero_scale_samples[-tail_len:]).detach().numpy(), 0.05, 0),
np.quantile((hetero_alpha_samples[-tail_len:] + hetero_scale_samples[-tail_len:]).detach().numpy(), 0.95, 0),
alpha=0.5,
label=r"95% Credible Interval for $\beta_t \pm \alpha_t$"
)
# Beta samples
plt.plot(hetero_alpha_samples[-tail_len:].T - hetero_scale_samples[-tail_len:].T, 'r--', linewidth=0.1)
plt.plot(hetero_alpha_samples[-tail_len:].T + hetero_scale_samples[-tail_len:].T, 'r--', linewidth=0.1)
plt.plot(hetero_alpha_samples[-tail_len:].T[0] + hetero_scale_samples[-tail_len:].T[0], 'r--', label=r'Samples of $\beta_t \pm \alpha_t$', linewidth=0.1)
# CI for Alpha
plt.fill_between(
torch.arange(N),
epistemic_alpha[0, ...],
epistemic_alpha[1, ...],
alpha=0.5,
color='red',
label=r"95% Credible Interval for $\alpha_t$"
)
# Samples for Alpha
plt.plot(hetero_alpha_samples[-tail_len:].T, 'b--', linewidth=0.3)
plt.plot(hetero_alpha_samples[-tail_len:].T[0, ...], 'b--', label=r'Samples of $\alpha_t$', linewidth=0.3)
plt.plot(y_obs, label="Observations", color="green", linewidth=3)
plt.title(r"Epistemic Uncertainty: Observations versus $\alpha_t$")
plt.legend()
plt.show()

Before we showed how there was uncertainty in where the data will be observed next (aleatoric) which we encoded with a distribution Normal(αt,βt)\text{Normal}(\alpha_t, \beta_t) made up of point estimates of parameters Alpha and Beta. But remember, these parameters are themselves estimated from the data and have additional uncertainty from the parameter estimation process! This is the epistemic uncertainty. Rather than show the point estimates, above we show a 95% credible interval for both αt\alpha_t and βt\beta_t at each time index. Note that βt\beta_t's uncertainty is superimposed (added and subtracted) on top of αt\alpha_t since it is measured in different units. It is now clear that by integrating in the epistemic uncertainty into our parameter estimates for the aleatoric uncertainty, the corresponding prediction intervals would be a great deal wider.

In either case, by using Bean Machine we can handle epistemic and aleatoric uncertainty quite easily with only about 1-2 lines of extra code!