Sampling Methods in AI

Sampling is how AI systems generate outputs from probability distributions. Whether a language model choosing the next word or a diffusion model creating an image, sampling strategies determine what gets produced. Understanding these methods is key to controlling AI behavior.

Why Sampling Matters

AI models often produce probability distributions, not single answers:

Language model output for "The cat sat on the ___":
P("mat") = 0.35
P("floor") = 0.25
P("chair") = 0.15
P("couch") = 0.12
P("bed") = 0.08
P("table") = 0.05

How do we choose one word? That's where sampling comes in.

Deterministic Selection

Argmax (Greedy) Sampling

Always pick the highest probability option:

def argmax_sample(probabilities):
    return max(range(len(probabilities)),
               key=lambda i: probabilities[i])

Pros:

Deterministic, reproducible
Good for factual answers

Cons:

No variety
Can get stuck in repetitive loops
Misses good but not top-ranked options

Beam Search

Keep top-k candidates at each step:

Step 1: "The" → ["cat", "dog", "bird"] (top 3)
Step 2: "The cat" → ["sat", "ran", "slept"]
        "The dog" → ["barked", "ran", "sat"]
        "The bird" → ["flew", "sang", "sat"]
        → Keep top 3 across all paths
Step 3: Continue expanding top paths

Pros:

Explores multiple possibilities
Better for translation, summarization

Cons:

Computationally expensive (k times the work)
Still deterministic
Can produce generic outputs

Random Sampling

Pure Random Sampling

Sample according to the probability distribution:

import random

def random_sample(values, probabilities):
    r = random.random()  # 0 to 1
    cumulative = 0
    for value, prob in zip(values, probabilities):
        cumulative += prob
        if r <= cumulative:
            return value

Pros:

Natural variety
Proportional to model's beliefs

Cons:

Can select low-probability (bad) options
High variance in output quality

Temperature Sampling

Scale logits before softmax:

def temperature_sample(logits, temperature=1.0):
    scaled = [l / temperature for l in logits]
    probs = softmax(scaled)
    return random_sample(probs)

T = 0: Approaches argmax (deterministic)
T = 1: Standard probabilities
T > 1: Flatter distribution (more random)

Truncated Sampling

Top-k Sampling

Only consider the k most likely options:

def top_k_sample(probabilities, k=10):
    # Get indices of top k probabilities
    top_indices = sorted(range(len(probabilities)),
                        key=lambda i: probabilities[i],
                        reverse=True)[:k]

    # Renormalize over top k
    top_probs = [probabilities[i] for i in top_indices]
    total = sum(top_probs)
    normalized = [p / total for p in top_probs]

    return random_sample(top_indices, normalized)

Pros:

Eliminates clearly bad options
Maintains diversity among good options

Cons:

Fixed k may be wrong for different contexts
- Some contexts have few good options (need small k)
- Others have many (need large k)

Top-p (Nucleus) Sampling

Keep the smallest set of tokens whose probability sums to p:

def top_p_sample(probabilities, p=0.9):
    # Sort by probability
    sorted_indices = sorted(range(len(probabilities)),
                           key=lambda i: probabilities[i],
                           reverse=True)

    # Find cutoff
    cumulative = 0
    cutoff_index = 0
    for i, idx in enumerate(sorted_indices):
        cumulative += probabilities[idx]
        if cumulative >= p:
            cutoff_index = i + 1
            break

    # Sample from nucleus
    nucleus_indices = sorted_indices[:cutoff_index]
    nucleus_probs = [probabilities[i] for i in nucleus_indices]
    total = sum(nucleus_probs)
    normalized = [pr / total for pr in nucleus_probs]

    return random_sample(nucleus_indices, normalized)

Pros:

Adapts to distribution shape
Narrow for confident predictions, wide for uncertain

Cons:

Slightly more complex
Need to tune p

Combined Strategies

Modern systems often combine methods:

def sample_with_controls(logits, temperature=0.7, top_p=0.9, top_k=50):
    # Apply temperature
    scaled = [l / temperature for l in logits]
    probs = softmax(scaled)

    # Apply top-k
    if top_k:
        probs = apply_top_k(probs, top_k)

    # Apply top-p
    if top_p:
        probs = apply_top_p(probs, top_p)

    return random_sample(probs)

Sampling in Different AI Systems

Language Models (GPT, Claude)

User parameters: temperature, top_p, top_k

Low temperature (0.0-0.3): Factual, focused responses
Medium temperature (0.7-0.9): Natural conversation
High temperature (1.0+): Creative, diverse responses

Image Generation (Diffusion Models)

Diffusion models sample by gradually denoising:

Start: Pure noise
Step 1: Sample slightly less noisy image
Step 2: Sample even less noisy image
...
Final: Clean image

Each step involves sampling from a learned noise distribution.

Reinforcement Learning

Action selection often uses sampling:

ε-greedy:

if random.random() < epsilon:
    return random_action()  # Explore
else:
    return best_action()    # Exploit

Boltzmann (softmax) exploration:

probs = softmax([Q(s,a) / temperature for a in actions])
return sample(probs)

Monte Carlo Methods

Sampling is fundamental to Monte Carlo estimation:

Estimating Expectations

def monte_carlo_expectation(distribution_sampler, function, n_samples=10000):
    """Estimate E[f(X)] by sampling."""
    samples = [distribution_sampler() for _ in range(n_samples)]
    return sum(function(x) for x in samples) / n_samples

Monte Carlo Integration

Estimate integrals by random sampling:

∫ f(x) dx ≈ (1/n) × Σ f(xᵢ), where xᵢ ~ Uniform

Used in:

Physics simulations
Rendering (ray tracing)
Financial modeling

Importance Sampling

When sampling from one distribution to estimate properties of another:

E_P[f(X)] ≈ (1/n) × Σ f(xᵢ) × P(xᵢ) / Q(xᵢ)

where xᵢ ~ Q (proposal distribution)

Use case: When P is hard to sample from directly.

MCMC (Markov Chain Monte Carlo)

For complex distributions without closed forms:

Metropolis-Hastings

def metropolis_hastings(target_pdf, proposal_fn, n_samples):
    samples = [initial_state]
    current = initial_state

    for _ in range(n_samples):
        # Propose new state
        proposed = proposal_fn(current)

        # Accept/reject
        accept_ratio = target_pdf(proposed) / target_pdf(current)
        if random.random() < accept_ratio:
            current = proposed

        samples.append(current)

    return samples

Applications

Bayesian posterior sampling
Statistical physics
Generative modeling (early approaches)

Reparameterization Trick

For training with sampling (VAEs, etc.):

Problem: Can't backprop through random sampling

Solution: Reparameterize:

Instead of: z ~ Normal(μ, σ²)
Write as:   z = μ + σ × ε, where ε ~ Normal(0, 1)

Now gradients flow through μ and σ.

Summary

Method	Use Case	Trade-off
Argmax	Factual answers	Deterministic, no variety
Beam search	Translation	Better quality, expensive
Random sampling	Creative tasks	High variety, variable quality
Temperature	Control randomness	Tune for task
Top-k	Limit to good options	Fixed cutoff
Top-p	Adaptive truncation	Adapts to context
MCMC	Complex distributions	Slow but general

Next, we'll dive deeper into how LLMs specifically choose tokens.

Sampling Methods in AI

Why Sampling Matters

AI models often produce probability distributions, not single answers:

Language model output for "The cat sat on the ___":
P("mat") = 0.35
P("floor") = 0.25
P("chair") = 0.15
P("couch") = 0.12
P("bed") = 0.08
P("table") = 0.05

How do we choose one word? That's where sampling comes in.

Deterministic Selection

Argmax (Greedy) Sampling

Always pick the highest probability option:

def argmax_sample(probabilities):
    return max(range(len(probabilities)),
               key=lambda i: probabilities[i])

Pros:

Deterministic, reproducible
Good for factual answers

Cons:

No variety
Can get stuck in repetitive loops
Misses good but not top-ranked options

Beam Search

Keep top-k candidates at each step:

Step 1: "The" → ["cat", "dog", "bird"] (top 3)
Step 2: "The cat" → ["sat", "ran", "slept"]
        "The dog" → ["barked", "ran", "sat"]
        "The bird" → ["flew", "sang", "sat"]
        → Keep top 3 across all paths
Step 3: Continue expanding top paths

Pros:

Explores multiple possibilities
Better for translation, summarization

Cons:

Computationally expensive (k times the work)
Still deterministic
Can produce generic outputs

Random Sampling

Pure Random Sampling

Sample according to the probability distribution:

import random

def random_sample(values, probabilities):
    r = random.random()  # 0 to 1
    cumulative = 0
    for value, prob in zip(values, probabilities):
        cumulative += prob
        if r <= cumulative:
            return value

Pros:

Natural variety
Proportional to model's beliefs

Cons:

Can select low-probability (bad) options
High variance in output quality

Temperature Sampling

Scale logits before softmax:

def temperature_sample(logits, temperature=1.0):
    scaled = [l / temperature for l in logits]
    probs = softmax(scaled)
    return random_sample(probs)

T = 0: Approaches argmax (deterministic)
T = 1: Standard probabilities
T > 1: Flatter distribution (more random)

Truncated Sampling

Top-k Sampling

Only consider the k most likely options:

def top_k_sample(probabilities, k=10):
    # Get indices of top k probabilities
    top_indices = sorted(range(len(probabilities)),
                        key=lambda i: probabilities[i],
                        reverse=True)[:k]

    # Renormalize over top k
    top_probs = [probabilities[i] for i in top_indices]
    total = sum(top_probs)
    normalized = [p / total for p in top_probs]

    return random_sample(top_indices, normalized)

Pros:

Eliminates clearly bad options
Maintains diversity among good options

Cons:

Fixed k may be wrong for different contexts
- Some contexts have few good options (need small k)
- Others have many (need large k)

Top-p (Nucleus) Sampling

Keep the smallest set of tokens whose probability sums to p:

def top_p_sample(probabilities, p=0.9):
    # Sort by probability
    sorted_indices = sorted(range(len(probabilities)),
                           key=lambda i: probabilities[i],
                           reverse=True)

    # Find cutoff
    cumulative = 0
    cutoff_index = 0
    for i, idx in enumerate(sorted_indices):
        cumulative += probabilities[idx]
        if cumulative >= p:
            cutoff_index = i + 1
            break

    # Sample from nucleus
    nucleus_indices = sorted_indices[:cutoff_index]
    nucleus_probs = [probabilities[i] for i in nucleus_indices]
    total = sum(nucleus_probs)
    normalized = [pr / total for pr in nucleus_probs]

    return random_sample(nucleus_indices, normalized)

Pros:

Adapts to distribution shape
Narrow for confident predictions, wide for uncertain

Cons:

Slightly more complex
Need to tune p

Combined Strategies

Modern systems often combine methods:

def sample_with_controls(logits, temperature=0.7, top_p=0.9, top_k=50):
    # Apply temperature
    scaled = [l / temperature for l in logits]
    probs = softmax(scaled)

    # Apply top-k
    if top_k:
        probs = apply_top_k(probs, top_k)

    # Apply top-p
    if top_p:
        probs = apply_top_p(probs, top_p)

    return random_sample(probs)

Sampling in Different AI Systems

Language Models (GPT, Claude)

User parameters: temperature, top_p, top_k

Low temperature (0.0-0.3): Factual, focused responses
Medium temperature (0.7-0.9): Natural conversation
High temperature (1.0+): Creative, diverse responses

Image Generation (Diffusion Models)

Diffusion models sample by gradually denoising:

Start: Pure noise
Step 1: Sample slightly less noisy image
Step 2: Sample even less noisy image
...
Final: Clean image

Each step involves sampling from a learned noise distribution.

Reinforcement Learning

Action selection often uses sampling:

ε-greedy:

if random.random() < epsilon:
    return random_action()  # Explore
else:
    return best_action()    # Exploit

Boltzmann (softmax) exploration:

probs = softmax([Q(s,a) / temperature for a in actions])
return sample(probs)

Monte Carlo Methods

Sampling is fundamental to Monte Carlo estimation:

Estimating Expectations

def monte_carlo_expectation(distribution_sampler, function, n_samples=10000):
    """Estimate E[f(X)] by sampling."""
    samples = [distribution_sampler() for _ in range(n_samples)]
    return sum(function(x) for x in samples) / n_samples

Monte Carlo Integration

Estimate integrals by random sampling:

∫ f(x) dx ≈ (1/n) × Σ f(xᵢ), where xᵢ ~ Uniform

Used in:

Physics simulations
Rendering (ray tracing)
Financial modeling

Importance Sampling

When sampling from one distribution to estimate properties of another:

E_P[f(X)] ≈ (1/n) × Σ f(xᵢ) × P(xᵢ) / Q(xᵢ)

where xᵢ ~ Q (proposal distribution)

Use case: When P is hard to sample from directly.

MCMC (Markov Chain Monte Carlo)

For complex distributions without closed forms:

Metropolis-Hastings

def metropolis_hastings(target_pdf, proposal_fn, n_samples):
    samples = [initial_state]
    current = initial_state

    for _ in range(n_samples):
        # Propose new state
        proposed = proposal_fn(current)

        # Accept/reject
        accept_ratio = target_pdf(proposed) / target_pdf(current)
        if random.random() < accept_ratio:
            current = proposed

        samples.append(current)

    return samples

Applications

Bayesian posterior sampling
Statistical physics
Generative modeling (early approaches)

Reparameterization Trick

For training with sampling (VAEs, etc.):

Problem: Can't backprop through random sampling

Solution: Reparameterize:

Instead of: z ~ Normal(μ, σ²)
Write as:   z = μ + σ × ε, where ε ~ Normal(0, 1)

Now gradients flow through μ and σ.

Summary

Method	Use Case	Trade-off
Argmax	Factual answers	Deterministic, no variety
Beam search	Translation	Better quality, expensive
Random sampling	Creative tasks	High variety, variable quality
Temperature	Control randomness	Tune for task
Top-k	Limit to good options	Fixed cutoff
Top-p	Adaptive truncation	Adapts to context
MCMC	Complex distributions	Slow but general

Next, we'll dive deeper into how LLMs specifically choose tokens.