Softmax and Temperature in AI
When a neural network makes predictions, its raw outputs (called logits) can be any real numbers. The softmax function transforms these into a probability distribution. Temperature then controls how "sharp" or "flat" that distribution is. These concepts are fundamental to how modern AI systems generate outputs.
The Problem Softmax Solves
A neural network's final layer produces logitsβraw scores that can be any real number:
Logits: [2.5, 1.0, -0.5]
cat dog bird
These aren't probabilities because:
- They don't sum to 1
- Some are negative
- They could be any magnitude
We need to convert them to a valid probability distribution.
The Softmax Function
Softmax converts logits into probabilities:
softmax(zα΅’) = e^(zα΅’) / Ξ£β±Ό e^(zβ±Ό)
For each logit:
- Exponentiate it (makes everything positive)
- Divide by the sum of all exponentiated values (makes them sum to 1)
Step-by-Step Example
Logits: [2.5, 1.0, -0.5]
Step 1: Exponentiate
e^2.5 = 12.18
e^1.0 = 2.72
e^-0.5 = 0.61
Step 2: Sum
12.18 + 2.72 + 0.61 = 15.51
Step 3: Normalize
P(cat) = 12.18 / 15.51 = 0.785
P(dog) = 2.72 / 15.51 = 0.175
P(bird) = 0.61 / 15.51 = 0.039
Result: [0.785, 0.175, 0.039] β a valid probability distribution!
Properties of Softmax
- Preserves order: Higher logits β higher probabilities
- Output range: Each value is in (0, 1)
- Sums to 1: Always produces valid distribution
- Smooth: Small changes in logits β small changes in probabilities
- Amplifies differences: Larger gaps between logits become more pronounced
Temperature
Temperature (T) scales the logits before softmax:
softmax(zα΅’ / T) = e^(zα΅’/T) / Ξ£β±Ό e^(zβ±Ό/T)
Temperature controls the "sharpness" of the distribution.
High Temperature (T > 1): Flatter Distribution
Logits: [2.5, 1.0, -0.5], T = 2.0
Scaled logits: [1.25, 0.5, -0.25]
Result: [0.52, 0.31, 0.17] β More uniform
- Probabilities become more similar
- More randomness in sampling
- Model appears "less confident"
- Outputs are more diverse/creative
Low Temperature (T < 1): Sharper Distribution
Logits: [2.5, 1.0, -0.5], T = 0.5
Scaled logits: [5.0, 2.0, -1.0]
Result: [0.95, 0.05, 0.002] β Very peaked
- One option dominates
- Less randomness in sampling
- Model appears "more confident"
- Outputs are more focused/deterministic
Temperature = 0 (Limit Case)
As T β 0, softmax approaches argmax:
- The highest logit gets probability ~1
- All others get probability ~0
- Completely deterministic
Temperature = β (Limit Case)
As T β β, softmax approaches uniform distribution:
- All options become equally likely
- Maximum randomness
- Ignores model's preferences
Visualization
Temperature Effect on Distribution
T = 0.5 (Sharp) T = 1.0 (Normal) T = 2.0 (Flat)
β β β
β β β
β ββ βββ
ββ ββ βββ
ββ βββ ββββ
βββ ββββ βββββ
ββββββββββ ββββββββββ ββββββββββ
cat dog bird cat dog bird cat dog bird
Temperature in Language Models
When ChatGPT or other LLMs generate text, they use temperature to control creativity:
Low Temperature (0.0 - 0.3): Focused
Prompt: "The capital of France is"
Output: "Paris" (deterministic, factual)
Best for:
- Factual questions
- Code generation
- Structured outputs
Medium Temperature (0.7 - 0.9): Balanced
Prompt: "Write a haiku about coding"
Output: Varies, but coherent and relevant
Best for:
- General conversation
- Creative but guided tasks
- Most use cases
High Temperature (1.0 - 2.0): Creative
Prompt: "Write a haiku about coding"
Output: Might be more unusual, surprising, or occasionally nonsensical
Best for:
- Brainstorming
- Creative writing exploration
- Generating diverse options
Softmax in Different Contexts
Classification
For classification, we typically use T = 1 and take argmax:
Logits β Softmax β Argmax β Predicted class
Knowledge Distillation
When training a smaller model to mimic a larger one, higher temperature reveals more information:
Teacher's softmax outputs (T = 4): [0.6, 0.25, 0.15]
This "soft" distribution contains more signal than:
Hard labels: [1, 0, 0]
The student learns not just that "cat" is right, but also that "dog" is more plausible than "bird."
Reinforcement Learning
In RL, temperature controls exploration vs. exploitation:
- High T: Explore more actions
- Low T: Exploit known good actions
Numerical Stability
Raw softmax can overflow (e^1000 = β) or underflow (e^-1000 β 0).
The Log-Sum-Exp Trick
Subtract the maximum logit before exponentiating:
def stable_softmax(logits):
max_logit = max(logits)
shifted = [z - max_logit for z in logits]
exp_values = [math.exp(z) for z in shifted]
total = sum(exp_values)
return [e / total for e in exp_values]
This gives the same result but avoids numerical issues.
Softmax Alternatives
Sparsemax
Produces sparse outputs (some probabilities exactly 0):
- Useful when you want the model to "ignore" some options
- More interpretable
Gumbel-Softmax
Enables backpropagation through discrete sampling:
- Used in training when you need to sample but also backpropagate
- Essential for some generative models
Top-k and Top-p (Nucleus) Sampling
Instead of using all probabilities:
Top-k: Only consider the k highest probabilities
Original: [0.7, 0.15, 0.08, 0.05, 0.02]
Top-3: [0.7, 0.15, 0.08] β renormalize β [0.753, 0.161, 0.086]
Top-p (nucleus): Keep smallest set that covers probability p
Original: [0.7, 0.15, 0.08, 0.05, 0.02]
Top-p (p=0.9): [0.7, 0.15, 0.08] β renormalize
Common Temperature Values
| Application | Typical Temperature | Reasoning |
|---|---|---|
| Classification | 1.0 | Standard training |
| Code generation | 0.0 - 0.2 | Precision matters |
| Conversation | 0.7 - 0.9 | Balance of coherence and variety |
| Creative writing | 1.0 - 1.5 | More diversity |
| Brainstorming | 1.5 - 2.0 | Maximum creativity |
| Knowledge distillation | 3.0 - 20.0 | Reveal teacher's uncertainty |
Summary
- Softmax converts arbitrary numbers into a probability distribution
- Temperature controls how peaked or flat the distribution is
- Low temperature β more deterministic, focused outputs
- High temperature β more random, diverse outputs
- Language models expose temperature as a key generation parameter
- The log-sum-exp trick ensures numerical stability
- Alternatives like top-k and top-p provide additional control
This completes Module 3! Next, we'll explore expected value and varianceβhow we measure the "average" and "spread" of probability distributions.
Discussion
Sign in to join the discussion.

