How LLMs Pick Tokens

Large Language Models like GPT-4 and Claude generate text one token at a time. Each token is selected from a probability distribution over the entire vocabulary. Understanding this process reveals why AI responses can be both impressively coherent and occasionally unpredictable.

The Token Generation Process

Step 1: Process the Context

The model takes in all previous tokens (the prompt and any generated text):

Input: "The capital of France is"

Step 2: Compute Logits

The final layer produces a score (logit) for every token in the vocabulary:

Vocabulary size: ~50,000 tokens

Logits: [
    "Paris": 8.2,
    "Lyon": 3.1,
    "the": -2.5,
    "France": 1.0,
    "<newline>": 0.5,
    ...50,000 more entries...
]

Step 3: Apply Softmax

Convert logits to probabilities:

P("Paris") = e^8.2 / Σe^logit = 0.73
P("Lyon") = e^3.1 / Σe^logit = 0.05
P("the") = e^-2.5 / Σe^logit = 0.002
...

Step 4: Sample

Select one token based on sampling parameters:

With temperature=0.7:
  Token selected: "Paris"

With temperature=1.5:
  Token selected: "Lyon" (less likely but possible)

Step 5: Repeat

Append the selected token and repeat:

"The capital of France is Paris"
→ Continue generating if needed

What is a Token?

Tokens are the fundamental units of text for LLMs:

"Hello, how are you?" → ["Hello", ",", " how", " are", " you", "?"]

"Cryptocurrency" → ["Cry", "pto", "currency"]  (broken into subwords)

"🎉" → single token (emoji)

"    " → might be 1-4 tokens depending on tokenizer

Common Tokenization

Most modern LLMs use Byte Pair Encoding (BPE):

Starts with individual characters
Merges frequent pairs into single tokens
Common words become single tokens
Rare words split into pieces

Token count affects:

Context window usage
Cost (many APIs charge per token)
Generation speed

The Autoregressive Loop

LLMs generate autoregressively—each token depends on all previous ones:

Step 1: P(token₁ | prompt)
Step 2: P(token₂ | prompt, token₁)
Step 3: P(token₃ | prompt, token₁, token₂)
...

Key insight: The model doesn't "plan" entire responses. It commits to each token before seeing the next.

This explains:

Why models can contradict themselves mid-sentence
Why they can get "stuck" in repetitive loops
Why longer responses accumulate errors

Temperature in Detail

Temperature scales logits before softmax:

def sample_with_temperature(logits, temperature):
    scaled = [l / temperature for l in logits]
    probs = softmax(scaled)
    return sample(probs)

Temperature = 0 (Actually approaches 0)

Logits: [8.2, 3.1, 1.0]
Scaled (T=0.01): [820, 310, 100]
Probs: [~1.0, ~0.0, ~0.0]

Always picks "Paris"

Temperature = 1 (Default)

Logits: [8.2, 3.1, 1.0]
Scaled (T=1): [8.2, 3.1, 1.0]
Probs: [0.73, 0.05, 0.004]

Usually picks "Paris", sometimes "Lyon"

Temperature = 2 (High)

Logits: [8.2, 3.1, 1.0]
Scaled (T=2): [4.1, 1.55, 0.5]
Probs: [0.52, 0.15, 0.05]

More variety in selections

Top-p (Nucleus) Sampling in Practice

Most modern LLMs use top-p sampling by default:

def top_p_sample(probs, p=0.9):
    """Keep tokens until cumulative probability reaches p."""
    sorted_tokens = sorted(enumerate(probs), key=lambda x: -x[1])

    cumulative = 0
    nucleus = []
    for token_id, prob in sorted_tokens:
        cumulative += prob
        nucleus.append((token_id, prob))
        if cumulative >= p:
            break

    # Renormalize and sample from nucleus
    total = sum(p for _, p in nucleus)
    return random_choice(nucleus, weights=[p/total for _, p in nucleus])

Why top-p works well:

Confident predictions: Small nucleus, focused output
Uncertain predictions: Large nucleus, more variety

Repetition Penalties

LLMs can get stuck repeating themselves:

"The cat sat on the mat. The cat sat on the mat. The cat sat on..."

Frequency Penalty

Reduce probability of tokens that appear often:

logit[token] -= frequency_penalty * count[token]

Presence Penalty

Reduce probability of any token that appeared:

logit[token] -= presence_penalty * (1 if token in context else 0)

Repetition Penalty (Multiplicative)

if token in context:
    if logit[token] > 0:
        logit[token] /= repetition_penalty
    else:
        logit[token] *= repetition_penalty

Stop Sequences

Generation ends when specific tokens appear:

stop_sequences = ["<|endoftext|>", "\n\nHuman:", "```"]

while True:
    token = generate_next_token()
    output.append(token)

    if any(output.ends_with(seq) for seq in stop_sequences):
        break

Context Window and Attention

The context window limits how far back the model "remembers":

GPT-4: 8,192 or 32,768 tokens
Claude 3: 200,000 tokens
Gemini 1.5: 1,000,000 tokens

What happens at the limit?

Older tokens are truncated
Some models use sliding windows
Retrieval-augmented generation (RAG) helps

Attention Pattern

For each new token, the model attends to all previous tokens:

Token 100 attends to: [Token 1, Token 2, ..., Token 99]
Token 101 attends to: [Token 1, Token 2, ..., Token 100]

Computation grows quadratically with context length.

Special Tokens

LLMs use special tokens for structure:

<|system|>: Start of system prompt
<|user|>: Start of user message
<|assistant|>: Start of assistant response
<|endoftext|>: End of conversation
<|pad|>: Padding for batching

These are trained to trigger specific behaviors.

Batched Generation

For efficiency, LLMs process multiple sequences together:

Batch of 4 prompts:
[
    "What is the capital of France?",
    "Explain quantum computing",
    "Write a haiku about cats",
    "Translate to Spanish: Hello"
]

All generate in parallel, but finish at different times.

Speculative Decoding

Speed up generation with a smaller model:

1. Small model generates k tokens quickly
2. Large model verifies all k in parallel
3. Accept matching tokens, reject others
4. Continue from last accepted token

Can be 2-3x faster with no quality loss.

KV Cache

To avoid recomputing, LLMs cache key-value pairs:

First token:  Compute K, V for all previous tokens
Second token: Use cached K, V, only compute for new token
Third token:  Use cached K, V, only compute for new token
...

This is why:

First token is slow (processing whole prompt)
Subsequent tokens are fast (incremental)

Logit Bias

Manually adjust token probabilities:

# Increase probability of "Python"
logit_bias = {"Python": +5}

# Decrease probability of "Java"
logit_bias = {"Java": -10}

Use cases:

Force specific formatting
Avoid certain words
Guide toward topics

Summary

Stage	What Happens
Input	Tokenize prompt
Forward pass	Compute logits for all vocab tokens
Temperature	Scale logits to control randomness
Top-p/Top-k	Truncate to likely tokens
Repetition penalty	Discourage repeats
Sampling	Pick one token
Stop check	End if stop sequence hit
Append	Add token to context
Repeat	Generate next token

Understanding this process helps you:

Choose better generation parameters
Debug unexpected outputs
Understand model limitations
Build better prompts

Next, we'll learn how to evaluate model performance with precision, recall, and other metrics.