How LLMs Pick Tokens
Large Language Models like GPT-4 and Claude generate text one token at a time. Each token is selected from a probability distribution over the entire vocabulary. Understanding this process reveals why AI responses can be both impressively coherent and occasionally unpredictable.
The Token Generation Process
Step 1: Process the Context
The model takes in all previous tokens (the prompt and any generated text):
Input: "The capital of France is"
Step 2: Compute Logits
The final layer produces a score (logit) for every token in the vocabulary:
Vocabulary size: ~50,000 tokens
Logits: [
"Paris": 8.2,
"Lyon": 3.1,
"the": -2.5,
"France": 1.0,
"<newline>": 0.5,
...50,000 more entries...
]
Step 3: Apply Softmax
Convert logits to probabilities:
P("Paris") = e^8.2 / Σe^logit = 0.73
P("Lyon") = e^3.1 / Σe^logit = 0.05
P("the") = e^-2.5 / Σe^logit = 0.002
...
Step 4: Sample
Select one token based on sampling parameters:
With temperature=0.7:
Token selected: "Paris"
With temperature=1.5:
Token selected: "Lyon" (less likely but possible)
Step 5: Repeat
Append the selected token and repeat:
"The capital of France is Paris"
→ Continue generating if needed
What is a Token?
Tokens are the fundamental units of text for LLMs:
"Hello, how are you?" → ["Hello", ",", " how", " are", " you", "?"]
"Cryptocurrency" → ["Cry", "pto", "currency"] (broken into subwords)
"🎉" → single token (emoji)
" " → might be 1-4 tokens depending on tokenizer
Common Tokenization
Most modern LLMs use Byte Pair Encoding (BPE):
- Starts with individual characters
- Merges frequent pairs into single tokens
- Common words become single tokens
- Rare words split into pieces
Token count affects:
- Context window usage
- Cost (many APIs charge per token)
- Generation speed
The Autoregressive Loop
LLMs generate autoregressively—each token depends on all previous ones:
Step 1: P(token₁ | prompt)
Step 2: P(token₂ | prompt, token₁)
Step 3: P(token₃ | prompt, token₁, token₂)
...
Key insight: The model doesn't "plan" entire responses. It commits to each token before seeing the next.
This explains:
- Why models can contradict themselves mid-sentence
- Why they can get "stuck" in repetitive loops
- Why longer responses accumulate errors
Temperature in Detail
Temperature scales logits before softmax:
def sample_with_temperature(logits, temperature):
scaled = [l / temperature for l in logits]
probs = softmax(scaled)
return sample(probs)
Temperature = 0 (Actually approaches 0)
Logits: [8.2, 3.1, 1.0]
Scaled (T=0.01): [820, 310, 100]
Probs: [~1.0, ~0.0, ~0.0]
Always picks "Paris"
Temperature = 1 (Default)
Logits: [8.2, 3.1, 1.0]
Scaled (T=1): [8.2, 3.1, 1.0]
Probs: [0.73, 0.05, 0.004]
Usually picks "Paris", sometimes "Lyon"
Temperature = 2 (High)
Logits: [8.2, 3.1, 1.0]
Scaled (T=2): [4.1, 1.55, 0.5]
Probs: [0.52, 0.15, 0.05]
More variety in selections
Top-p (Nucleus) Sampling in Practice
Most modern LLMs use top-p sampling by default:
def top_p_sample(probs, p=0.9):
"""Keep tokens until cumulative probability reaches p."""
sorted_tokens = sorted(enumerate(probs), key=lambda x: -x[1])
cumulative = 0
nucleus = []
for token_id, prob in sorted_tokens:
cumulative += prob
nucleus.append((token_id, prob))
if cumulative >= p:
break
# Renormalize and sample from nucleus
total = sum(p for _, p in nucleus)
return random_choice(nucleus, weights=[p/total for _, p in nucleus])
Why top-p works well:
- Confident predictions: Small nucleus, focused output
- Uncertain predictions: Large nucleus, more variety
Repetition Penalties
LLMs can get stuck repeating themselves:
"The cat sat on the mat. The cat sat on the mat. The cat sat on..."
Frequency Penalty
Reduce probability of tokens that appear often:
logit[token] -= frequency_penalty * count[token]
Presence Penalty
Reduce probability of any token that appeared:
logit[token] -= presence_penalty * (1 if token in context else 0)
Repetition Penalty (Multiplicative)
if token in context:
if logit[token] > 0:
logit[token] /= repetition_penalty
else:
logit[token] *= repetition_penalty
Stop Sequences
Generation ends when specific tokens appear:
stop_sequences = ["<|endoftext|>", "\n\nHuman:", "```"]
while True:
token = generate_next_token()
output.append(token)
if any(output.ends_with(seq) for seq in stop_sequences):
break
Context Window and Attention
The context window limits how far back the model "remembers":
GPT-4: 8,192 or 32,768 tokens
Claude 3: 200,000 tokens
Gemini 1.5: 1,000,000 tokens
What happens at the limit?
- Older tokens are truncated
- Some models use sliding windows
- Retrieval-augmented generation (RAG) helps
Attention Pattern
For each new token, the model attends to all previous tokens:
Token 100 attends to: [Token 1, Token 2, ..., Token 99]
Token 101 attends to: [Token 1, Token 2, ..., Token 100]
Computation grows quadratically with context length.
Special Tokens
LLMs use special tokens for structure:
<|system|>: Start of system prompt
<|user|>: Start of user message
<|assistant|>: Start of assistant response
<|endoftext|>: End of conversation
<|pad|>: Padding for batching
These are trained to trigger specific behaviors.
Batched Generation
For efficiency, LLMs process multiple sequences together:
Batch of 4 prompts:
[
"What is the capital of France?",
"Explain quantum computing",
"Write a haiku about cats",
"Translate to Spanish: Hello"
]
All generate in parallel, but finish at different times.
Speculative Decoding
Speed up generation with a smaller model:
1. Small model generates k tokens quickly
2. Large model verifies all k in parallel
3. Accept matching tokens, reject others
4. Continue from last accepted token
Can be 2-3x faster with no quality loss.
KV Cache
To avoid recomputing, LLMs cache key-value pairs:
First token: Compute K, V for all previous tokens
Second token: Use cached K, V, only compute for new token
Third token: Use cached K, V, only compute for new token
...
This is why:
- First token is slow (processing whole prompt)
- Subsequent tokens are fast (incremental)
Logit Bias
Manually adjust token probabilities:
# Increase probability of "Python"
logit_bias = {"Python": +5}
# Decrease probability of "Java"
logit_bias = {"Java": -10}
Use cases:
- Force specific formatting
- Avoid certain words
- Guide toward topics
Summary
| Stage | What Happens |
|---|---|
| Input | Tokenize prompt |
| Forward pass | Compute logits for all vocab tokens |
| Temperature | Scale logits to control randomness |
| Top-p/Top-k | Truncate to likely tokens |
| Repetition penalty | Discourage repeats |
| Sampling | Pick one token |
| Stop check | End if stop sequence hit |
| Append | Add token to context |
| Repeat | Generate next token |
Understanding this process helps you:
- Choose better generation parameters
- Debug unexpected outputs
- Understand model limitations
- Build better prompts
Next, we'll learn how to evaluate model performance with precision, recall, and other metrics.

