Sampling Methods in AI
Sampling is how AI systems generate outputs from probability distributions. Whether a language model choosing the next word or a diffusion model creating an image, sampling strategies determine what gets produced. Understanding these methods is key to controlling AI behavior.
Why Sampling Matters
AI models often produce probability distributions, not single answers:
Language model output for "The cat sat on the ___":
P("mat") = 0.35
P("floor") = 0.25
P("chair") = 0.15
P("couch") = 0.12
P("bed") = 0.08
P("table") = 0.05
How do we choose one word? That's where sampling comes in.
Deterministic Selection
Argmax (Greedy) Sampling
Always pick the highest probability option:
def argmax_sample(probabilities):
return max(range(len(probabilities)),
key=lambda i: probabilities[i])
Pros:
- Deterministic, reproducible
- Good for factual answers
Cons:
- No variety
- Can get stuck in repetitive loops
- Misses good but not top-ranked options
Beam Search
Keep top-k candidates at each step:
Step 1: "The" → ["cat", "dog", "bird"] (top 3)
Step 2: "The cat" → ["sat", "ran", "slept"]
"The dog" → ["barked", "ran", "sat"]
"The bird" → ["flew", "sang", "sat"]
→ Keep top 3 across all paths
Step 3: Continue expanding top paths
Pros:
- Explores multiple possibilities
- Better for translation, summarization
Cons:
- Computationally expensive (k times the work)
- Still deterministic
- Can produce generic outputs
Random Sampling
Pure Random Sampling
Sample according to the probability distribution:
import random
def random_sample(values, probabilities):
r = random.random() # 0 to 1
cumulative = 0
for value, prob in zip(values, probabilities):
cumulative += prob
if r <= cumulative:
return value
Pros:
- Natural variety
- Proportional to model's beliefs
Cons:
- Can select low-probability (bad) options
- High variance in output quality
Temperature Sampling
Scale logits before softmax:
def temperature_sample(logits, temperature=1.0):
scaled = [l / temperature for l in logits]
probs = softmax(scaled)
return random_sample(probs)
- T = 0: Approaches argmax (deterministic)
- T = 1: Standard probabilities
- T > 1: Flatter distribution (more random)
Truncated Sampling
Top-k Sampling
Only consider the k most likely options:
def top_k_sample(probabilities, k=10):
# Get indices of top k probabilities
top_indices = sorted(range(len(probabilities)),
key=lambda i: probabilities[i],
reverse=True)[:k]
# Renormalize over top k
top_probs = [probabilities[i] for i in top_indices]
total = sum(top_probs)
normalized = [p / total for p in top_probs]
return random_sample(top_indices, normalized)
Pros:
- Eliminates clearly bad options
- Maintains diversity among good options
Cons:
- Fixed k may be wrong for different contexts
- Some contexts have few good options (need small k)
- Others have many (need large k)
Top-p (Nucleus) Sampling
Keep the smallest set of tokens whose probability sums to p:
def top_p_sample(probabilities, p=0.9):
# Sort by probability
sorted_indices = sorted(range(len(probabilities)),
key=lambda i: probabilities[i],
reverse=True)
# Find cutoff
cumulative = 0
cutoff_index = 0
for i, idx in enumerate(sorted_indices):
cumulative += probabilities[idx]
if cumulative >= p:
cutoff_index = i + 1
break
# Sample from nucleus
nucleus_indices = sorted_indices[:cutoff_index]
nucleus_probs = [probabilities[i] for i in nucleus_indices]
total = sum(nucleus_probs)
normalized = [pr / total for pr in nucleus_probs]
return random_sample(nucleus_indices, normalized)
Pros:
- Adapts to distribution shape
- Narrow for confident predictions, wide for uncertain
Cons:
- Slightly more complex
- Need to tune p
Combined Strategies
Modern systems often combine methods:
def sample_with_controls(logits, temperature=0.7, top_p=0.9, top_k=50):
# Apply temperature
scaled = [l / temperature for l in logits]
probs = softmax(scaled)
# Apply top-k
if top_k:
probs = apply_top_k(probs, top_k)
# Apply top-p
if top_p:
probs = apply_top_p(probs, top_p)
return random_sample(probs)
Sampling in Different AI Systems
Language Models (GPT, Claude)
User parameters: temperature, top_p, top_k
Low temperature (0.0-0.3): Factual, focused responses
Medium temperature (0.7-0.9): Natural conversation
High temperature (1.0+): Creative, diverse responses
Image Generation (Diffusion Models)
Diffusion models sample by gradually denoising:
Start: Pure noise
Step 1: Sample slightly less noisy image
Step 2: Sample even less noisy image
...
Final: Clean image
Each step involves sampling from a learned noise distribution.
Reinforcement Learning
Action selection often uses sampling:
ε-greedy:
if random.random() < epsilon:
return random_action() # Explore
else:
return best_action() # Exploit
Boltzmann (softmax) exploration:
probs = softmax([Q(s,a) / temperature for a in actions])
return sample(probs)
Monte Carlo Methods
Sampling is fundamental to Monte Carlo estimation:
Estimating Expectations
def monte_carlo_expectation(distribution_sampler, function, n_samples=10000):
"""Estimate E[f(X)] by sampling."""
samples = [distribution_sampler() for _ in range(n_samples)]
return sum(function(x) for x in samples) / n_samples
Monte Carlo Integration
Estimate integrals by random sampling:
∫ f(x) dx ≈ (1/n) × Σ f(xᵢ), where xᵢ ~ Uniform
Used in:
- Physics simulations
- Rendering (ray tracing)
- Financial modeling
Importance Sampling
When sampling from one distribution to estimate properties of another:
E_P[f(X)] ≈ (1/n) × Σ f(xᵢ) × P(xᵢ) / Q(xᵢ)
where xᵢ ~ Q (proposal distribution)
Use case: When P is hard to sample from directly.
MCMC (Markov Chain Monte Carlo)
For complex distributions without closed forms:
Metropolis-Hastings
def metropolis_hastings(target_pdf, proposal_fn, n_samples):
samples = [initial_state]
current = initial_state
for _ in range(n_samples):
# Propose new state
proposed = proposal_fn(current)
# Accept/reject
accept_ratio = target_pdf(proposed) / target_pdf(current)
if random.random() < accept_ratio:
current = proposed
samples.append(current)
return samples
Applications
- Bayesian posterior sampling
- Statistical physics
- Generative modeling (early approaches)
Reparameterization Trick
For training with sampling (VAEs, etc.):
Problem: Can't backprop through random sampling
Solution: Reparameterize:
Instead of: z ~ Normal(μ, σ²)
Write as: z = μ + σ × ε, where ε ~ Normal(0, 1)
Now gradients flow through μ and σ.
Summary
| Method | Use Case | Trade-off |
|---|---|---|
| Argmax | Factual answers | Deterministic, no variety |
| Beam search | Translation | Better quality, expensive |
| Random sampling | Creative tasks | High variety, variable quality |
| Temperature | Control randomness | Tune for task |
| Top-k | Limit to good options | Fixed cutoff |
| Top-p | Adaptive truncation | Adapts to context |
| MCMC | Complex distributions | Slow but general |
Next, we'll dive deeper into how LLMs specifically choose tokens.

