SnakeByte[17] The Metropolis Algorithm

Frame Grab From the movie Metropolis 1927

Who told you to attack the machines, you fools? Without them you’ll all die!!

~ Grot, the Guardian of the Heart Machine

First, as always, Oh Dear Reader, i hope you are safe. There are many unsafe places in and around the world in this current time. Second, this blog is a SnakeByte[] based on something that i knew about but had no idea it was called this by this name.

Third, relative to this, i must confess, Oh, Dear Reader, i have a disease of the bibliomaniac kind. i have an obsession with books and reading. “They” say that belief comes first, followed by admission. There is a Japanese word that translates to having so many books you cannot possibly read them all. This word is tsundoku. From the website (if you click on the word):

“Tsundoku dates from the Meiji era, and derives from a combination of tsunde-oku (to let things pile up) and dokusho (to read books). It can also refer to the stacks themselves. Crucially, it doesn’t carry a pejorative connotation, being more akin to bookworm than an irredeemable slob.”

Thus, while perusing a math-related book site, i came across a monograph entitled “The Metropolis Algorithm: Theory and Examples” by C Douglas Howard [1].

i was intrigued, and because it was 5 bucks (Side note: i always try to buy used and loved books), i decided to throw it into the virtual shopping buggy.

Upon receiving said monograph, i sat down to read it, and i was amazed to find it was closely related to something I was very familiar with from decades ago. This finally brings us to the current SnakeByte[].

The Metropolis Algorithm is a method in computational statistics used to sample from complex probability distributions. It is a type of Markov Chain Monte Carlo (MCMC) algorithm (i had no idea), which relies on Markov Chains to generate a sequence of samples that can approximate a desired distribution, even when direct sampling is complex. Yes, let me say that again – i had no idea. Go ahead LazyWeb^TM laugh!

So let us start with how the Metropolis Algorithm and how it relates to Markov Chains. (Caveat Emptor: You will need to dig out those statistics books and a little linear algebra.)

Markov Chains Basics

A Markov Chain is a mathematical system that transitions from one state to another in a state space. It has the property that the next state depends only on the current state, not the sequence of states preceding it. This is called the Markov property. The algorithm was introduced by Metropolis et al. (1953) in a Statistical Physics context and was generalized by Hastings (1970). It was considered in the context of image analysis (Geman and Geman, 1984) and data augmentation (Tanner (I’m not related that i know of…) and Wong, 1987). However, its routine use in statistics (especially for Bayesian inference) did not take place until Gelfand and Smith (1990) popularised it. For modern discussions of MCMC, see e.g. Tierney (1994), Smith and Roberts (1993), Gilks et al. (1996), and Roberts and Rosenthal (1998b).

Ergo, the name Metropolis-Hastings algorithm. Once again, i had no idea.

Anyhow,

A Markov Chain can be described by a set of states $S$ and a transition matrix $P$ , where each element $P_{ij}$ represents the probability of transitioning from state $i$ to state $j$ .

Provide The Goal: Sampling from a Probability Distribution $\pi(x)$

In many applications (e.g., statistical mechanics, Bayesian inference, as mentioned), we are interested in sampling from a complex probability distribution $\pi(x)$ . This distribution might be difficult to sample from directly, but we can use a Markov Chain to create a sequence of samples that, after a certain period (called the burn-in period), will approximate $\pi(x)$ .

Ok Now: The Metropolis Algorithm

The Metropolis Algorithm is one of the simplest MCMC algorithms to generate samples from $\pi(x)$ . It works by constructing a Markov Chain whose stationary distribution is the desired probability distribution $\pi(x)$ . A stationary distribution is a probability distribution that remains the same over time in a Markov chain. Thus it can describe the long-term behavior of a chain, where the probabilities of being in each state do not change as time passes. (Whatever time is, i digress.)

The key steps of the algorithm are:

Initialization

Start with an initial guess $x_0$ , a point in the state space. This point can be chosen randomly or based on prior knowledge.

Proposal Step

From the current state $x_t$ , propose a new state $x^*$ using a proposal distribution $q(x^*|x_t)$ , which suggests a candidate for the next state. This proposal distribution can be symmetric (e.g., a normal distribution centered at $x_t$ ) or asymmetric.

Acceptance Probability

Calculate the acceptance probability $\alpha$ for moving from the current state $x_t$ to the proposed state $x^*$ :

$\alpha = \min \left(1, \frac{\pi(x^) q(x_t | x^)}{\pi(x_t) q(x^* | x_t)} \right)$

In the case where the proposal distribution is symmetric (i.e., $q(x^|x_t) = q(x_t|x^))$ , the formula simplifies to:

$\alpha = \min \left(1, \frac{\pi(x^*)}{\pi(x_t)} \right)$

Acceptance or Rejection

Generate a random number $u$ from a uniform distribution $U(0, 1)$
If $u \leq \alpha$ , accept the proposed state $x^*$ , i.e., $set x_{t+1} = x^*$ .
If $u > \alpha$ , reject the proposed state and remain at the current state, i.e., $set x_{t+1} = x_t$ .

Repeat

Repeat the proposal, acceptance, and rejection steps to generate a Markov Chain of samples.

Convergence and Stationary Distribution:

Over time, as more samples are generated, the Markov Chain converges to a stationary distribution. The stationary distribution is the target distribution $\pi(x)$ , meaning the samples generated by the algorithm will approximate $\pi(x)$ more closely as the number of iterations increases.

Applications:

The Metropolis Algorithm is widely used in various fields such as Bayesian statistics, physics (e.g., in the simulation of physical systems), machine learning, and finance. It is especially useful for high-dimensional problems where direct sampling is computationally expensive or impossible.

Key Features of the Metropolis Algorithm:

Simplicity: It’s easy to implement and doesn’t require knowledge of the normalization constant of $\pi(x)$ , which can be difficult to compute.
Flexibility: It works with a wide range of proposal distributions, allowing the algorithm to be adapted to different problem contexts.
Efficiency: While it can be computationally demanding, the algorithm can provide high-quality approximations to complex distributions with well-chosen proposals and sufficient iterations.

The Metropolis-Hastings Algorithm is a more general version that allows for non-symmetric proposal distributions, expanding the range of problems the algorithm can handle.

Now let us code it up:

i am going to assume the underlying distribution is Gaussian with a time-dependent mean $\mu_t$ , which changes slowly over time. We’ll use a simple time-series analytics setup to sample this distribution using the Metropolis Algorithm and plot the results. Note: When the target distribution is Gaussian (or close to Gaussian), the algorithm can converge more quickly to the true distribution because of the symmetric smooth nature of the normal distribution.

import numpy as np
import matplotlib.pyplot as plt

# Time-dependent mean function (example: sinusoidal pattern)
def mu_t(t):
    return 10 * np.sin(0.1 * t)

# Target distribution: Gaussian with time-varying mean mu_t and fixed variance
def target_distribution(x, t):
    mu = mu_t(t)
    sigma = 1.0  # Assume fixed variance for simplicity
    return np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# Metropolis Algorithm for time-series sampling
def metropolis_sampling(num_samples, initial_x, proposal_std, time_steps):
    samples = np.zeros(num_samples)
    samples[0] = initial_x

    # Iterate over the time steps
    for t in range(1, num_samples):
        # Propose a new state based on the current state
        x_current = samples[t - 1]
        x_proposed = np.random.normal(x_current, proposal_std)

        # Acceptance probability (Metropolis-Hastings step)
        acceptance_ratio = target_distribution(x_proposed, time_steps[t]) / target_distribution(x_current, time_steps[t])
        acceptance_probability = min(1, acceptance_ratio)

        # Accept or reject the proposed sample
        if np.random.rand() < acceptance_probability:
            samples[t] = x_proposed
        else:
            samples[t] = x_current

    return samples

# Parameters
num_samples = 10000  # Total number of samples to generate
initial_x = 0.0      # Initial state
proposal_std = 0.5   # Standard deviation for proposal distribution
time_steps = np.linspace(0, 1000, num_samples)  # Time steps for temporal evolution

# Run the Metropolis Algorithm
samples = metropolis_sampling(num_samples, initial_x, proposal_std, time_steps)

# Plot the time series of samples and the underlying mean function
plt.figure(figsize=(12, 6))

# Plot the samples over time
plt.plot(time_steps, samples, label='Metropolis Samples', alpha=0.7)

# Plot the underlying time-varying mean (true function)
plt.plot(time_steps, mu_t(time_steps), label='True Mean ', color='red', linewidth=2)

plt.title("Metropolis Algorithm Sampling with Time-Varying Gaussian Distribution")
plt.xlabel("Time")
plt.ylabel("Sample Value")
plt.legend()
plt.grid(True)
plt.show()

Output of Python Script Figure 1.0

Ok, What’s going on here?

For the Target Distribution:

The function $mu_t(t)$ defines a time-varying mean for the distribution. In this example, it follows a sinusoidal pattern.
The function target_distribution(x, t) models a Gaussian distribution with mean $\mu_t$ and a fixed variance (set to 1.0).

Metropolis Algorithm:

The metropolis_sampling function implements the Metropolis algorithm. It iterates over time, generating samples from the time-varying distribution. The acceptance probability is calculated using the target distribution at each time step.

Proposal Distribution:

A normal distribution centered around the current state with standard deviation proposal_std is used to propose new states.

Temporal Evolution:

The time steps are generated using np.linspace to simulate temporal evolution, which can be used in time-series analytics.

Plot The Results:

The results are plotted, showing the samples generated by the Metropolis algorithm as well as the true underlying mean function $\mu_t$ (in red).

The plot shows the Metropolis samples over time, which should cluster around the time-varying mean $\mu_t$ of the distribution. As time progresses, the samples follow the red curve (the true mean) as time moves on like and arrow in this case.

Now you are probably asking “Hey is there a more pythonic library way to to this?”. Oh Dear Reader i am glad you asked! Yes There Is A Python Library! AFAIC PyMC started it all. Most probably know it as PyMc3 (formerly known as…). There is a great writeup here: History of PyMc.

We are golden age of probabilistic programming.

~ Chris Fonnesbeck (creator of PyMC)

Lets convert it using PyMC. Steps to Conversion:

Define the probabilistic model using PyMC’s modeling syntax.
Specify the Gaussian likelihood with the time-varying mean $\mu_t$ .
Use PyMC’s built-in Metropolis sampler.
Visualize the results similarly to how we did earlier.

import pymc as pm
import numpy as np
import matplotlib.pyplot as plt

# Time-dependent mean function (example: sinusoidal pattern)
def mu_t(t):
    return 10 * np.sin(0.1 * t)

# Set random seed for reproducibility
np.random.seed(42)

# Number of time points and samples
num_samples = 10000
time_steps = np.linspace(0, 1000, num_samples)

# PyMC model definition
with pm.Model() as model:
    # Prior for the time-varying parameter (mean of Gaussian)
    mu_t_values = mu_t(time_steps)

    # Observational model: Normally distributed samples with time-varying mean and fixed variance
    sigma = 1.0  # Fixed variance
    x = pm.Normal('x', mu=mu_t_values, sigma=sigma, shape=num_samples)

    # Use the Metropolis sampler explicitly
    step = pm.Metropolis()

    # Run MCMC sampling with the Metropolis step
    samples_all = pm.sample(num_samples, tune=1000, step=step, chains=5, return_inferencedata=False)

# Extract one chain's worth of samples for plotting
samples = samples_all['x'][0]  # Taking only the first chain

# Plot the time series of samples and the underlying mean function
plt.figure(figsize=(12, 6))

# Plot the samples over time
plt.plot(time_steps, samples, label='PyMC Metropolis Samples', alpha=0.7)

# Plot the underlying time-varying mean (true function)
plt.plot(time_steps, mu_t(time_steps), label='True Mean ', color='red', linewidth=2)

plt.title("PyMC Metropolis Sampling with Time-Varying Gaussian Distribution")
plt.xlabel("Time")
plt.ylabel("Sample Value")
plt.legend()
plt.grid(True)
plt.show()

When you execute this code you will see the following status bar:

It will be a while. Go grab your favorite beverage and take a walk…..

Output of Python Script Figure 1.1

Key Differences from the Previous Code:

PyMC Model Usage Definition:
In PyMC, the model is defined using the pm.Model() context. The x variable is defined as a Normal distribution with the time-varying mean $\mu_t$ . Instead of manually implementing the acceptance probability, PyMC handles this automatically with the specified sampler.

Metropolis Sampler:
PyMC allows us to specify the sampling method. Here, we explicitly use the Metropolis algorithm with pm.Metropolis().

Samples Parameter:
We specify shape=num_samples in the pm.Normal() distribution to indicate that we want a series of samples for each time step.

Plotting:
The resulting plot will show the sampled values using the PyMC Metropolis algorithm compared with the true underlying mean, similar to the earlier approach. Now, samples has the same shape as time_steps (in this case, both with 10,000 elements), allowing you to plot the sample values correctly against the time points; otherwise, the x and y axes would not align.

NOTE: We used this library at one of our previous health startups with great success.

Optimizations herewith include several. There is a default setting in PyMC which is called NUTS.
No need to manually set the number of leapfrog steps. NUTS automatically determines the optimal number of steps for each iteration, preventing inefficient or divergent sampling. NUTS automatically stops the trajectory when it detects that the particle is about to turn back on itself (i.e., when the trajectory “U-turns”). A U-turn means that continuing to move in the same direction would result in redundant exploration of the space and inefficient sampling. When NUTS detects this, it terminates the trajectory early, preventing unnecessary steps. Also the acceptance rates on convergence are higher.

There are several references to this set of algorithms. It truly a case of both mathematical and computational elegance.

Of course you have to know what the name means. They say words have meanings. Then again one cannot know everything.

Until Then,

#iwishyouwater <- Of all places Alabama getting the memo From Helene 2024

𝕋𝕖𝕕 ℂ. 𝕋𝕒𝕟𝕟𝕖𝕣 𝕁𝕣. (@tctjr) / X

Music To Blog By: View From The Magicians Window, The Psychic Circle

References:

[1] The Metropolis Algorithm: Theory and Examples by C Douglas Howard

[2] The Metropolis-Hastings Algorithm: A note by Danielle Navarro

[3] Github code for Sample Based Inference by bashhwu

Entire Metropolis Movie For Your Viewing Pleasure. (AFAIC The most amazing Sci-Fi movie besides BladeRunner)

Share this:

Leave a Reply Cancel reply