Softmax and Temperature in AI

When a neural network makes predictions, its raw outputs (called logits) can be any real numbers. The softmax function transforms these into a probability distribution. Temperature then controls how "sharp" or "flat" that distribution is. These concepts are fundamental to how modern AI systems generate outputs.

The Problem Softmax Solves

A neural network's final layer produces logits—raw scores that can be any real number:

Logits: [2.5, 1.0, -0.5]
        cat   dog  bird

These aren't probabilities because:

They don't sum to 1
Some are negative
They could be any magnitude

We need to convert them to a valid probability distribution.

The Softmax Function

Softmax converts logits into probabilities:

softmax(zᵢ) = e^(zᵢ) / Σⱼ e^(zⱼ)

For each logit:

Exponentiate it (makes everything positive)
Divide by the sum of all exponentiated values (makes them sum to 1)

Step-by-Step Example

Logits: [2.5, 1.0, -0.5]

Step 1: Exponentiate

e^2.5 = 12.18
e^1.0 = 2.72
e^-0.5 = 0.61

Step 2: Sum

12.18 + 2.72 + 0.61 = 15.51

Step 3: Normalize

P(cat) = 12.18 / 15.51 = 0.785
P(dog) = 2.72 / 15.51 = 0.175
P(bird) = 0.61 / 15.51 = 0.039

Result: [0.785, 0.175, 0.039] — a valid probability distribution!

Properties of Softmax

Preserves order: Higher logits → higher probabilities
Output range: Each value is in (0, 1)
Sums to 1: Always produces valid distribution
Smooth: Small changes in logits → small changes in probabilities
Amplifies differences: Larger gaps between logits become more pronounced

Temperature

Temperature (T) scales the logits before softmax:

softmax(zᵢ / T) = e^(zᵢ/T) / Σⱼ e^(zⱼ/T)

Temperature controls the "sharpness" of the distribution.

High Temperature (T > 1): Flatter Distribution

Logits: [2.5, 1.0, -0.5], T = 2.0

Scaled logits: [1.25, 0.5, -0.25]

Result: [0.52, 0.31, 0.17]  ← More uniform

Probabilities become more similar
More randomness in sampling
Model appears "less confident"
Outputs are more diverse/creative

Low Temperature (T < 1): Sharper Distribution

Logits: [2.5, 1.0, -0.5], T = 0.5

Scaled logits: [5.0, 2.0, -1.0]

Result: [0.95, 0.05, 0.002]  ← Very peaked

One option dominates
Less randomness in sampling
Model appears "more confident"
Outputs are more focused/deterministic

Temperature = 0 (Limit Case)

As T → 0, softmax approaches argmax:

The highest logit gets probability ~1
All others get probability ~0
Completely deterministic

Temperature = ∞ (Limit Case)

As T → ∞, softmax approaches uniform distribution:

All options become equally likely
Maximum randomness
Ignores model's preferences

Visualization

Temperature Effect on Distribution

T = 0.5 (Sharp)     T = 1.0 (Normal)     T = 2.0 (Flat)

    █                    █                    █
    █                    █                    █
    █                    ██                   ███
    ██                   ██                   ███
    ██                   ███                  ████
    ███                  ████                 █████
  ──────────          ──────────           ──────────
  cat dog bird        cat dog bird         cat dog bird

Temperature in Language Models

When ChatGPT or other LLMs generate text, they use temperature to control creativity:

Low Temperature (0.0 - 0.3): Focused

Prompt: "The capital of France is"
Output: "Paris" (deterministic, factual)

Best for:

Factual questions
Code generation
Structured outputs

Medium Temperature (0.7 - 0.9): Balanced

Prompt: "Write a haiku about coding"
Output: Varies, but coherent and relevant

Best for:

General conversation
Creative but guided tasks
Most use cases

High Temperature (1.0 - 2.0): Creative

Prompt: "Write a haiku about coding"
Output: Might be more unusual, surprising, or occasionally nonsensical

Best for:

Brainstorming
Creative writing exploration
Generating diverse options

Softmax in Different Contexts

Classification

For classification, we typically use T = 1 and take argmax:

Logits → Softmax → Argmax → Predicted class

Knowledge Distillation

When training a smaller model to mimic a larger one, higher temperature reveals more information:

Teacher's softmax outputs (T = 4): [0.6, 0.25, 0.15]

This "soft" distribution contains more signal than:

Hard labels: [1, 0, 0]

The student learns not just that "cat" is right, but also that "dog" is more plausible than "bird."

Reinforcement Learning

In RL, temperature controls exploration vs. exploitation:

High T: Explore more actions
Low T: Exploit known good actions

Numerical Stability

Raw softmax can overflow (e^1000 = ∞) or underflow (e^-1000 ≈ 0).

The Log-Sum-Exp Trick

Subtract the maximum logit before exponentiating:

def stable_softmax(logits):
    max_logit = max(logits)
    shifted = [z - max_logit for z in logits]
    exp_values = [math.exp(z) for z in shifted]
    total = sum(exp_values)
    return [e / total for e in exp_values]

This gives the same result but avoids numerical issues.

Softmax Alternatives

Sparsemax

Produces sparse outputs (some probabilities exactly 0):

Useful when you want the model to "ignore" some options
More interpretable

Gumbel-Softmax

Enables backpropagation through discrete sampling:

Used in training when you need to sample but also backpropagate
Essential for some generative models

Top-k and Top-p (Nucleus) Sampling

Instead of using all probabilities:

Top-k: Only consider the k highest probabilities

Original: [0.7, 0.15, 0.08, 0.05, 0.02]
Top-3: [0.7, 0.15, 0.08] → renormalize → [0.753, 0.161, 0.086]

Top-p (nucleus): Keep smallest set that covers probability p

Original: [0.7, 0.15, 0.08, 0.05, 0.02]
Top-p (p=0.9): [0.7, 0.15, 0.08] → renormalize

Common Temperature Values

Application	Typical Temperature	Reasoning
Classification	1.0	Standard training
Code generation	0.0 - 0.2	Precision matters
Conversation	0.7 - 0.9	Balance of coherence and variety
Creative writing	1.0 - 1.5	More diversity
Brainstorming	1.5 - 2.0	Maximum creativity
Knowledge distillation	3.0 - 20.0	Reveal teacher's uncertainty

Summary

Softmax converts arbitrary numbers into a probability distribution
Temperature controls how peaked or flat the distribution is
Low temperature → more deterministic, focused outputs
High temperature → more random, diverse outputs
Language models expose temperature as a key generation parameter
The log-sum-exp trick ensures numerical stability
Alternatives like top-k and top-p provide additional control

This completes Module 3! Next, we'll explore expected value and variance—how we measure the "average" and "spread" of probability distributions.

Softmax and Temperature in AI

The Problem Softmax Solves

A neural network's final layer produces logits—raw scores that can be any real number:

Logits: [2.5, 1.0, -0.5]
        cat   dog  bird

These aren't probabilities because:

They don't sum to 1
Some are negative
They could be any magnitude

We need to convert them to a valid probability distribution.

The Softmax Function

Softmax converts logits into probabilities:

softmax(zᵢ) = e^(zᵢ) / Σⱼ e^(zⱼ)

For each logit:

Exponentiate it (makes everything positive)
Divide by the sum of all exponentiated values (makes them sum to 1)

Step-by-Step Example

Logits: [2.5, 1.0, -0.5]

Step 1: Exponentiate

e^2.5 = 12.18
e^1.0 = 2.72
e^-0.5 = 0.61

Step 2: Sum

12.18 + 2.72 + 0.61 = 15.51

Step 3: Normalize

P(cat) = 12.18 / 15.51 = 0.785
P(dog) = 2.72 / 15.51 = 0.175
P(bird) = 0.61 / 15.51 = 0.039

Result: [0.785, 0.175, 0.039] — a valid probability distribution!

Properties of Softmax

Preserves order: Higher logits → higher probabilities
Output range: Each value is in (0, 1)
Sums to 1: Always produces valid distribution
Smooth: Small changes in logits → small changes in probabilities
Amplifies differences: Larger gaps between logits become more pronounced

Temperature

Temperature (T) scales the logits before softmax:

softmax(zᵢ / T) = e^(zᵢ/T) / Σⱼ e^(zⱼ/T)

Temperature controls the "sharpness" of the distribution.

High Temperature (T > 1): Flatter Distribution

Logits: [2.5, 1.0, -0.5], T = 2.0

Scaled logits: [1.25, 0.5, -0.25]

Result: [0.52, 0.31, 0.17]  ← More uniform

Probabilities become more similar
More randomness in sampling
Model appears "less confident"
Outputs are more diverse/creative

Low Temperature (T < 1): Sharper Distribution

Logits: [2.5, 1.0, -0.5], T = 0.5

Scaled logits: [5.0, 2.0, -1.0]

Result: [0.95, 0.05, 0.002]  ← Very peaked

One option dominates
Less randomness in sampling
Model appears "more confident"
Outputs are more focused/deterministic

Temperature = 0 (Limit Case)

As T → 0, softmax approaches argmax:

The highest logit gets probability ~1
All others get probability ~0
Completely deterministic

Temperature = ∞ (Limit Case)

As T → ∞, softmax approaches uniform distribution:

All options become equally likely
Maximum randomness
Ignores model's preferences

Visualization

Temperature Effect on Distribution

T = 0.5 (Sharp)     T = 1.0 (Normal)     T = 2.0 (Flat)

    █                    █                    █
    █                    █                    █
    █                    ██                   ███
    ██                   ██                   ███
    ██                   ███                  ████
    ███                  ████                 █████
  ──────────          ──────────           ──────────
  cat dog bird        cat dog bird         cat dog bird

Temperature in Language Models

When ChatGPT or other LLMs generate text, they use temperature to control creativity:

Low Temperature (0.0 - 0.3): Focused

Prompt: "The capital of France is"
Output: "Paris" (deterministic, factual)

Best for:

Factual questions
Code generation
Structured outputs

Medium Temperature (0.7 - 0.9): Balanced

Prompt: "Write a haiku about coding"
Output: Varies, but coherent and relevant

Best for:

General conversation
Creative but guided tasks
Most use cases

High Temperature (1.0 - 2.0): Creative

Prompt: "Write a haiku about coding"
Output: Might be more unusual, surprising, or occasionally nonsensical

Best for:

Brainstorming
Creative writing exploration
Generating diverse options

Softmax in Different Contexts

Classification

For classification, we typically use T = 1 and take argmax:

Logits → Softmax → Argmax → Predicted class

Knowledge Distillation

When training a smaller model to mimic a larger one, higher temperature reveals more information:

Teacher's softmax outputs (T = 4): [0.6, 0.25, 0.15]

This "soft" distribution contains more signal than:

Hard labels: [1, 0, 0]

The student learns not just that "cat" is right, but also that "dog" is more plausible than "bird."

Reinforcement Learning

In RL, temperature controls exploration vs. exploitation:

High T: Explore more actions
Low T: Exploit known good actions

Numerical Stability

Raw softmax can overflow (e^1000 = ∞) or underflow (e^-1000 ≈ 0).

The Log-Sum-Exp Trick

Subtract the maximum logit before exponentiating:

def stable_softmax(logits):
    max_logit = max(logits)
    shifted = [z - max_logit for z in logits]
    exp_values = [math.exp(z) for z in shifted]
    total = sum(exp_values)
    return [e / total for e in exp_values]

This gives the same result but avoids numerical issues.

Softmax Alternatives

Sparsemax

Produces sparse outputs (some probabilities exactly 0):

Useful when you want the model to "ignore" some options
More interpretable

Gumbel-Softmax

Enables backpropagation through discrete sampling:

Used in training when you need to sample but also backpropagate
Essential for some generative models

Top-k and Top-p (Nucleus) Sampling

Instead of using all probabilities:

Top-k: Only consider the k highest probabilities

Original: [0.7, 0.15, 0.08, 0.05, 0.02]
Top-3: [0.7, 0.15, 0.08] → renormalize → [0.753, 0.161, 0.086]

Top-p (nucleus): Keep smallest set that covers probability p

Original: [0.7, 0.15, 0.08, 0.05, 0.02]
Top-p (p=0.9): [0.7, 0.15, 0.08] → renormalize

Common Temperature Values

Application	Typical Temperature	Reasoning
Classification	1.0	Standard training
Code generation	0.0 - 0.2	Precision matters
Conversation	0.7 - 0.9	Balance of coherence and variety
Creative writing	1.0 - 1.5	More diversity
Brainstorming	1.5 - 2.0	Maximum creativity
Knowledge distillation	3.0 - 20.0	Reveal teacher's uncertainty

Summary

Softmax converts arbitrary numbers into a probability distribution
Temperature controls how peaked or flat the distribution is
Low temperature → more deterministic, focused outputs
High temperature → more random, diverse outputs
Language models expose temperature as a key generation parameter
The log-sum-exp trick ensures numerical stability
Alternatives like top-k and top-p provide additional control

This completes Module 3! Next, we'll explore expected value and variance—how we measure the "average" and "spread" of probability distributions.