Softmax and Temperature in AI
When a neural network makes predictions, its raw outputs (called logits) can be any real numbers. The softmax function transforms these into a probability distribution. Temperature then controls how "sharp" or "flat" that distribution is. These concepts are fundamental to how modern AI systems generate outputs.
The Problem Softmax Solves
A neural network's final layer produces logits—raw scores that can be any real number:
Logits: [2.5, 1.0, -0.5]
cat dog bird
These aren't probabilities because:
- They don't sum to 1
- Some are negative
- They could be any magnitude
We need to convert them to a valid probability distribution.
The Softmax Function
Softmax converts logits into probabilities:
softmax(zᵢ) = e^(zᵢ) / Σⱼ e^(zⱼ)
For each logit:
- Exponentiate it (makes everything positive)
- Divide by the sum of all exponentiated values (makes them sum to 1)
Step-by-Step Example
Logits: [2.5, 1.0, -0.5]
Step 1: Exponentiate
e^2.5 = 12.18
e^1.0 = 2.72
e^-0.5 = 0.61
Step 2: Sum
12.18 + 2.72 + 0.61 = 15.51
Step 3: Normalize
P(cat) = 12.18 / 15.51 = 0.785
P(dog) = 2.72 / 15.51 = 0.175
P(bird) = 0.61 / 15.51 = 0.039
Result: [0.785, 0.175, 0.039] — a valid probability distribution!
Properties of Softmax
- Preserves order: Higher logits → higher probabilities
- Output range: Each value is in (0, 1)
- Sums to 1: Always produces valid distribution
- Smooth: Small changes in logits → small changes in probabilities
- Amplifies differences: Larger gaps between logits become more pronounced
Temperature
Temperature (T) scales the logits before softmax:
softmax(zᵢ / T) = e^(zᵢ/T) / Σⱼ e^(zⱼ/T)
Temperature controls the "sharpness" of the distribution.
High Temperature (T > 1): Flatter Distribution
Logits: [2.5, 1.0, -0.5], T = 2.0
Scaled logits: [1.25, 0.5, -0.25]
Result: [0.52, 0.31, 0.17] ← More uniform
- Probabilities become more similar
- More randomness in sampling
- Model appears "less confident"
- Outputs are more diverse/creative
Low Temperature (T < 1): Sharper Distribution
Logits: [2.5, 1.0, -0.5], T = 0.5
Scaled logits: [5.0, 2.0, -1.0]
Result: [0.95, 0.05, 0.002] ← Very peaked
- One option dominates
- Less randomness in sampling
- Model appears "more confident"
- Outputs are more focused/deterministic
Temperature = 0 (Limit Case)
As T → 0, softmax approaches argmax:
- The highest logit gets probability ~1
- All others get probability ~0
- Completely deterministic
Temperature = ∞ (Limit Case)
As T → ∞, softmax approaches uniform distribution:
- All options become equally likely
- Maximum randomness
- Ignores model's preferences
Visualization
Temperature Effect on Distribution
T = 0.5 (Sharp) T = 1.0 (Normal) T = 2.0 (Flat)
█ █ █
█ █ █
█ ██ ███
██ ██ ███
██ ███ ████
███ ████ █████
────────── ────────── ──────────
cat dog bird cat dog bird cat dog bird
Temperature in Language Models
When ChatGPT or other LLMs generate text, they use temperature to control creativity:
Low Temperature (0.0 - 0.3): Focused
Prompt: "The capital of France is"
Output: "Paris" (deterministic, factual)
Best for:
- Factual questions
- Code generation
- Structured outputs
Medium Temperature (0.7 - 0.9): Balanced
Prompt: "Write a haiku about coding"
Output: Varies, but coherent and relevant
Best for:
- General conversation
- Creative but guided tasks
- Most use cases
High Temperature (1.0 - 2.0): Creative
Prompt: "Write a haiku about coding"
Output: Might be more unusual, surprising, or occasionally nonsensical
Best for:
- Brainstorming
- Creative writing exploration
- Generating diverse options
Softmax in Different Contexts
Classification
For classification, we typically use T = 1 and take argmax:
Logits → Softmax → Argmax → Predicted class
Knowledge Distillation
When training a smaller model to mimic a larger one, higher temperature reveals more information:
Teacher's softmax outputs (T = 4): [0.6, 0.25, 0.15]
This "soft" distribution contains more signal than:
Hard labels: [1, 0, 0]
The student learns not just that "cat" is right, but also that "dog" is more plausible than "bird."
Reinforcement Learning
In RL, temperature controls exploration vs. exploitation:
- High T: Explore more actions
- Low T: Exploit known good actions
Numerical Stability
Raw softmax can overflow (e^1000 = ∞) or underflow (e^-1000 ≈ 0).
The Log-Sum-Exp Trick
Subtract the maximum logit before exponentiating:
def stable_softmax(logits):
max_logit = max(logits)
shifted = [z - max_logit for z in logits]
exp_values = [math.exp(z) for z in shifted]
total = sum(exp_values)
return [e / total for e in exp_values]
This gives the same result but avoids numerical issues.
Softmax Alternatives
Sparsemax
Produces sparse outputs (some probabilities exactly 0):
- Useful when you want the model to "ignore" some options
- More interpretable
Gumbel-Softmax
Enables backpropagation through discrete sampling:
- Used in training when you need to sample but also backpropagate
- Essential for some generative models
Top-k and Top-p (Nucleus) Sampling
Instead of using all probabilities:
Top-k: Only consider the k highest probabilities
Original: [0.7, 0.15, 0.08, 0.05, 0.02]
Top-3: [0.7, 0.15, 0.08] → renormalize → [0.753, 0.161, 0.086]
Top-p (nucleus): Keep smallest set that covers probability p
Original: [0.7, 0.15, 0.08, 0.05, 0.02]
Top-p (p=0.9): [0.7, 0.15, 0.08] → renormalize
Common Temperature Values
| Application | Typical Temperature | Reasoning |
|---|---|---|
| Classification | 1.0 | Standard training |
| Code generation | 0.0 - 0.2 | Precision matters |
| Conversation | 0.7 - 0.9 | Balance of coherence and variety |
| Creative writing | 1.0 - 1.5 | More diversity |
| Brainstorming | 1.5 - 2.0 | Maximum creativity |
| Knowledge distillation | 3.0 - 20.0 | Reveal teacher's uncertainty |
Summary
- Softmax converts arbitrary numbers into a probability distribution
- Temperature controls how peaked or flat the distribution is
- Low temperature → more deterministic, focused outputs
- High temperature → more random, diverse outputs
- Language models expose temperature as a key generation parameter
- The log-sum-exp trick ensures numerical stability
- Alternatives like top-k and top-p provide additional control
This completes Module 3! Next, we'll explore expected value and variance—how we measure the "average" and "spread" of probability distributions.

