The Sigmoid Function: Squashing Outputs for Classification

Master the sigmoid function — how it works, its mathematical properties, its role in logistic regression and neural networks, and why it’s fundamental to classification.

The Sigmoid Function: Squashing Outputs for Classification

The sigmoid function σ(z) = 1 / (1 + e^(−z)) is a mathematical function that maps any real number to a value between 0 and 1, making it ideal for converting raw model scores into probabilities. Its S-shaped curve smoothly transitions from near-0 for very negative inputs to near-1 for very positive inputs, crossing exactly 0.5 at z=0. In machine learning, sigmoid is used as the output activation in binary classification (logistic regression and neural networks), and its derivative σ'(z) = σ(z)(1 − σ(z)) makes backpropagation computationally elegant. Understanding sigmoid deeply means understanding why logistic regression produces valid probabilities and how neural network classification works.

Introduction: The Function That Turns Scores Into Probabilities

Imagine a judge scoring contestants in a competition. The raw scores might range from −50 to +200, but you need to turn them into probabilities of winning — numbers between 0 and 1. Simply dividing by 200 doesn’t work if scores go negative. Clipping to [0,1] creates discontinuities. What you need is a smooth, principled transformation that maps the entire real number line to the interval (0, 1).

That transformation is the sigmoid function. Given any real number — positive, negative, large, small — sigmoid returns a number strictly between 0 and 1. The function increases monotonically from near-0 to near-1, creating the characteristic S-shaped curve that gives it its name (sigma is the Greek letter S). At the input value of exactly 0, it returns exactly 0.5 — perfect uncertainty between two classes.

Sigmoid is everywhere in machine learning. It is the output activation of logistic regression, the gate mechanism in LSTM cells, an optional activation in neural network hidden layers, the foundation of the binary cross-entropy loss function, and the function whose properties you must understand to reason about vanishing gradients. It connects probability theory, information theory, and neural computation in a single elegant formula.

This comprehensive guide explores the sigmoid function from every angle. You’ll learn the formula and its derivation, every important mathematical property, geometric intuition, the derivative and why it matters for learning, numerical stability considerations, comparison with related functions (tanh, softmax, ReLU), real-world applications, and complete Python implementations with rich visualizations.

The Formula and Its Components

The Definition

Plaintext
σ(z) = 1 / (1 + e^(−z))

Where:
  z = any real number (−∞ to +∞)
  e = Euler's number ≈ 2.71828
  σ = lowercase sigma (Greek letter S)
  σ(z) ∈ (0, 1) — strictly between 0 and 1

Breaking Down the Formula

Each part of the formula has a role:

The exponential e^(−z):

Plaintext
z = −6: e^(−(−6)) = e^6  ≈ 403.4   (large positive)
z =  0: e^(−0)   = e^0  =   1.0   (exactly 1)
z = +6: e^(−6)          ≈   0.002 (near zero)

As z → −∞: e^(−z) → +∞
As z → +∞: e^(−z) → 0

Adding 1: (1 + e^(−z)):

Plaintext
z = −6: 1 + 403.4 = 404.4   (large)
z =  0: 1 + 1.0   = 2.0     (exactly 2)
z = +6: 1 + 0.002 = 1.002   (just above 1)

Taking the reciprocal: 1 / (1 + e^(−z)):

Plaintext
z = −6: 1 / 404.4  ≈ 0.0025  (near 0)
z =  0: 1 / 2.0    = 0.5000  (exactly 0.5)
z = +6: 1 / 1.002  ≈ 0.9975  (near 1)

Result always in (0, 1) — a valid probability!

Why It’s Called “Sigmoid”

Sigma (σ) is the 18th letter of the Greek alphabet, which resembles an S. The function’s graph is S-shaped, giving it the name. The logistic function (another name for sigmoid) was first described by Pierre-François Verhulst in 1838 to model population growth — populations grow slowly, then rapidly, then level off as resources run out. This S-curve appears throughout nature and mathematics.

Mathematical Properties

Property 1: Range Is Strictly (0, 1)

Plaintext
For all z ∈ (−∞, +∞):
  0 < σ(z) < 1

σ(z) never equals exactly 0 or 1:
  As z → −∞: σ(z) → 0  (approaches but never reaches)
  As z → +∞: σ(z) → 1  (approaches but never reaches)

This makes σ(z) a valid probability for any finite input.

Property 2: σ(0) = 0.5 — Perfect Uncertainty

Plaintext
σ(0) = 1 / (1 + e^0) = 1 / (1 + 1) = 1/2 = 0.5

Interpretation:
  When z = 0, the model has no preference between classes.
  P(class 1) = P(class 0) = 0.5 — maximum uncertainty.

The decision boundary in logistic regression is exactly z = 0,
i.e., where σ(z) = 0.5.

Property 3: Symmetry Around (0, 0.5)

Plaintext
1 − σ(z) = σ(−z)

Proof:
  1 − σ(z) = 1 − 1/(1+e^(−z))
            = e^(−z) / (1+e^(−z))
            = 1 / (1+e^z)
            = σ(−z)  ✓

Implication:
  σ(3) = 1 − σ(−3)
  σ(5) = 1 − σ(−5)
  "Confidence of class 1 at z" equals "Confidence of class 0 at −z"

  P(class 1 | z) = P(class 0 | −z)

Property 4: Monotonically Increasing

Plaintext
If z₁ < z₂, then σ(z₁) < σ(z₂)

Consequence:
  Larger z → higher probability
  The ranking of examples by z equals
  the ranking by σ(z) — crucial for ROC-AUC calculations

  ROC-AUC computed on z (raw scores) equals
  ROC-AUC computed on σ(z) (probabilities)

Property 5: The Elegant Derivative

The sigmoid derivative is one of the most beautiful in mathematics:

Plaintext
dσ/dz = σ(z) × (1 − σ(z))

Proof:
  σ(z)   = (1 + e^(−z))^(−1)

  dσ/dz  = −(1 + e^(−z))^(−2) × (−e^(−z))
          = e^(−z) / (1 + e^(−z))²
          = [1/(1+e^(−z))] × [e^(−z)/(1+e^(−z))]
          = σ(z) × [1 − 1/(1+e^(−z))]
          = σ(z) × (1 − σ(z))  ✓

Beauty: The derivative is expressed entirely in terms of σ(z) itself.
Once you've computed σ(z) during the forward pass,
the derivative is free — just multiply σ(z) × (1 − σ(z)).

Derivative values:

Plaintext
z = −6: σ'(−6) = 0.0025 × 0.9975 ≈ 0.0025  (nearly flat)
z = −3: σ'(−3) = 0.0474 × 0.9526 ≈ 0.045   (shallow)
z =  0: σ'(0)  = 0.5    × 0.5    = 0.25    (maximum slope)
z = +3: σ'(3)  = 0.9526 × 0.0474 ≈ 0.045   (shallow)
z = +6: σ'(6)  = 0.9975 × 0.0025 ≈ 0.0025  (nearly flat)

Maximum derivative at z=0: σ'(0) = 0.25
The sigmoid is steepest at its midpoint.

Property 6: Saturation in Tails

Plaintext
For |z| >> 0, σ(z) approaches 0 or 1 very slowly:
  σ(3)  = 0.9526   (0.048 from 1)
  σ(6)  = 0.9975   (0.003 from 1)
  σ(10) = 0.9999546 (0.00005 from 1)

In saturated regions, σ'(z) ≈ 0.
Gradients vanish → backpropagation fails to update early layers.

This is the vanishing gradient problem for deep networks.
(Why ReLU replaced sigmoid in hidden layers)

Complete Python Implementation and Visualization

Python
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec

# ── Core sigmoid functions ─────────────────────────────────────

def sigmoid(z):
    """Standard sigmoid — may overflow for large negative z."""
    return 1 / (1 + np.exp(-z))

def sigmoid_stable(z):
    """
    Numerically stable sigmoid using two-branch computation.
    Avoids overflow in exp(-z) for very negative z,
    and overflow in exp(z) for very positive z.
    """
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),          # Safe for z >= 0
        np.exp(z) / (1 + np.exp(z))    # Safe for z < 0
    )

def sigmoid_derivative(z):
    """dσ/dz = σ(z)(1 − σ(z))."""
    s = sigmoid_stable(z)
    return s * (1 - s)

def sigmoid_inverse(p):
    """
    Logit function — inverse of sigmoid.
    Maps probability p ∈ (0,1) back to real number z.
    logit(p) = log(p / (1−p))
    """
    p = np.clip(p, 1e-15, 1 - 1e-15)
    return np.log(p / (1 - p))


# ── Key values table ───────────────────────────────────────────
z_vals = np.array([-6, -4, -2, -1, 0, 1, 2, 4, 6])

print("┌───────┬──────────┬──────────┐")
print("│   z   │  σ(z)    │  σ'(z)   │")
print("├───────┼──────────┼──────────┤")
for z in z_vals:
    s  = sigmoid_stable(z)
    ds = sigmoid_derivative(z)
    print(f"│ {z:+5.1f}{s:.6f}{ds:.6f} │")
print("└───────┴──────────┴──────────┘")

Output:

Plaintext
┌───────┬──────────┬──────────┐
│   z   │  σ(z)    │  σ'(z)   │
├───────┼──────────┼──────────┤
│  -6.0 │  0.002473│  0.002467│
│  -4.0 │  0.017986│  0.017663│
│  -2.0 │  0.119203│  0.104994│
│  -1.0 │  0.268941│  0.196612│
│  +0.0 │  0.500000│  0.250000│
│  +1.0 │  0.731059│  0.196612│
│  +2.0 │  0.880797│  0.104994│
│  +4.0 │  0.982014│  0.017663│
│  +6.0 │  0.997527│  0.002467│
└───────┴──────────┴──────────┘

Comprehensive Visualization

Python
z = np.linspace(-8, 8, 500)

fig = plt.figure(figsize=(16, 10))
gs  = GridSpec(2, 3, figure=fig, hspace=0.4, wspace=0.35)

# ── Panel 1: Sigmoid curve ────────────────────────────────────
ax1 = fig.add_subplot(gs[0, 0])
ax1.plot(z, sigmoid_stable(z), 'steelblue', linewidth=2.5)
ax1.axhline(0.5, color='gray', linestyle='--', alpha=0.7,
            linewidth=1, label='σ(0) = 0.5')
ax1.axhline(0.0, color='lightgray', linestyle='-', alpha=0.5, linewidth=0.8)
ax1.axhline(1.0, color='lightgray', linestyle='-', alpha=0.5, linewidth=0.8)
ax1.axvline(0.0, color='gray', linestyle='--', alpha=0.7, linewidth=1)
ax1.scatter([0], [0.5], color='red', s=60, zorder=5)
ax1.set_title('σ(z) = 1/(1+e⁻ᶻ)', fontsize=11)
ax1.set_xlabel('z')
ax1.set_ylabel('σ(z)')
ax1.set_ylim(-0.05, 1.05)
ax1.legend(fontsize=9)
ax1.grid(True, alpha=0.3)
ax1.annotate('σ(0)=0.5\n(decision\nboundary)',
             xy=(0, 0.5), xytext=(2, 0.3),
             arrowprops=dict(arrowstyle='->', color='red'),
             fontsize=8, color='red')

# ── Panel 2: Derivative ────────────────────────────────────────
ax2 = fig.add_subplot(gs[0, 1])
ax2.plot(z, sigmoid_derivative(z), 'coral', linewidth=2.5,
         label="σ'(z) = σ(z)(1−σ(z))")
ax2.scatter([0], [0.25], color='red', s=60, zorder=5,
            label='Max at z=0: σ\'(0)=0.25')
ax2.set_title("Sigmoid Derivative σ'(z)", fontsize=11)
ax2.set_xlabel('z')
ax2.set_ylabel("σ'(z)")
ax2.legend(fontsize=9)
ax2.grid(True, alpha=0.3)
ax2.fill_between(z, sigmoid_derivative(z), alpha=0.2, color='coral')

# ── Panel 3: Symmetry property ────────────────────────────────
ax3 = fig.add_subplot(gs[0, 2])
s_pos = sigmoid_stable(z[z >= 0])
s_neg = 1 - sigmoid_stable(-z[z >= 0])
ax3.plot(z[z >= 0], s_pos, 'steelblue', linewidth=2.5, label='σ(z)')
ax3.plot(z[z >= 0], s_neg, 'coral',     linewidth=2.5,
         linestyle='--', label='1−σ(−z)')
ax3.set_title('Symmetry: σ(z) = 1 − σ(−z)', fontsize=11)
ax3.set_xlabel('z ≥ 0')
ax3.set_ylabel('Value')
ax3.legend(fontsize=9)
ax3.grid(True, alpha=0.3)

# ── Panel 4: Saturation zones ─────────────────────────────────
ax4 = fig.add_subplot(gs[1, 0])
s    = sigmoid_stable(z)
ds   = sigmoid_derivative(z)
ax4.plot(z, s,  'steelblue', linewidth=2,   label='σ(z)')
ax4.plot(z, ds, 'coral',     linewidth=1.5, label="σ'(z)")
ax4.fill_between(z, ds, alpha=0.3, color='coral')
ax4.fill_between(z[z < -4], s[z < -4], alpha=0.15, color='gray',
                 label='Saturation zones\n(vanishing gradient)')
ax4.fill_between(z[z >  4], s[z >  4], alpha=0.15, color='gray')
ax4.axvspan(-8, -4, alpha=0.07, color='red')
ax4.axvspan(4,   8, alpha=0.07, color='red')
ax4.set_title('Saturation Zones\n(gradient ≈ 0)', fontsize=11)
ax4.set_xlabel('z')
ax4.legend(fontsize=8)
ax4.grid(True, alpha=0.3)

# ── Panel 5: Logit (inverse sigmoid) ─────────────────────────
ax5 = fig.add_subplot(gs[1, 1])
p   = np.linspace(0.001, 0.999, 500)
ax5.plot(p, sigmoid_inverse(p), 'seagreen', linewidth=2.5)
ax5.axhline(0, color='gray', linestyle='--', alpha=0.7)
ax5.axvline(0.5, color='gray', linestyle='--', alpha=0.7,
            label='logit(0.5) = 0')
ax5.scatter([0.5], [0.0], color='red', s=60, zorder=5)
ax5.set_title('Logit(p) = σ⁻¹(p) = log(p/(1−p))', fontsize=11)
ax5.set_xlabel('Probability p')
ax5.set_ylabel('z = logit(p)')
ax5.set_ylim(-6, 6)
ax5.legend(fontsize=9)
ax5.grid(True, alpha=0.3)

# ── Panel 6: Comparing activation functions ───────────────────
ax6 = fig.add_subplot(gs[1, 2])
relu  = np.maximum(0, z)
tanh  = np.tanh(z)

ax6.plot(z, sigmoid_stable(z), 'steelblue', lw=2, label='Sigmoid σ(z)')
ax6.plot(z, tanh,              'coral',     lw=2, label='Tanh(z)')
ax6.plot(z, relu/6,            'seagreen',  lw=2, label='ReLU(z)/6')
ax6.axhline(0, color='gray', linewidth=0.8)
ax6.set_title('Sigmoid vs. Tanh vs. ReLU\n(ReLU scaled for visibility)',
              fontsize=10)
ax6.set_xlabel('z')
ax6.set_ylabel('Activation')
ax6.legend(fontsize=9)
ax6.grid(True, alpha=0.3)

plt.suptitle('The Sigmoid Function: Complete Visual Reference',
             fontsize=14, fontweight='bold')
plt.show()

The Logit Function: Sigmoid’s Inverse

Definition

Plaintext
logit(p) = σ⁻¹(p) = log(p / (1−p))

The logit maps probabilities back to real numbers:
  p = 0.01 → logit = log(0.01/0.99) = −4.60
  p = 0.25 → logit = log(0.25/0.75) = −1.10
  p = 0.50 → logit = log(0.50/0.50) = 0.00
  p = 0.75 → logit = log(0.75/0.25) = +1.10
  p = 0.99 → logit = log(0.99/0.01) = +4.60

Log-Odds Interpretation

Plaintext
logit(p) = log(p / (1−p)) = log(odds)

Odds: p/(1−p) = "How many times more likely is class 1 vs. class 0?"
  p = 0.75: odds = 3:1 ("three times more likely")
  p = 0.50: odds = 1:1 ("equally likely")

In logistic regression:
  logit(P(y=1|x)) = w₁x₁ + w₂x₂ + ... + b

  Interpretation of w₁:
  "Each unit increase in x₁ increases the log-odds by w₁"
  "Each unit increase multiplies the odds by e^(w₁)"

  Example: w₁ = 0.5 for age in churn prediction
  Each additional year of age multiplies churn odds by e^0.5 ≈ 1.65
  → 65% higher odds per year of age

Sigmoid in Neural Networks

As Output Layer Activation (Binary Classification)

Python
import torch
import torch.nn as nn

class BinaryClassifier(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),                    # Hidden: ReLU (not sigmoid!)
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
            nn.Linear(hidden_dim // 2, 1),
            nn.Sigmoid()                  # Output: Sigmoid for probability
        )

    def forward(self, x):
        return self.network(x).squeeze()

# BCELoss = Binary Cross-Entropy Loss
# Used with sigmoid output
criterion = nn.BCELoss()

# Alternative: BCEWithLogitsLoss (more numerically stable)
# Combines sigmoid + BCE in one numerically stable operation
criterion_stable = nn.BCEWithLogitsLoss()
# → Use this in practice: skip Sigmoid layer, use raw logits

Why Modern Networks Use BCEWithLogitsLoss

Python
# STANDARD (less stable):
output = sigmoid(raw_logit)        # σ(z)
loss   = -[y*log(output) + (1-y)*log(1-output)]
# Problem: log(sigmoid(z)) can underflow for large z

# BETTER (numerically stable):
# BCEWithLogitsLoss combines both in one operation:
# loss = max(z, 0) - z*y + log(1 + exp(-|z|))
# Avoids overflow/underflow automatically

# PyTorch example:
model_logits = nn.Sequential(
    nn.Linear(input_dim, hidden_dim),
    nn.ReLU(),
    nn.Linear(hidden_dim, 1)
    # NO sigmoid here — output raw logits
)
criterion = nn.BCEWithLogitsLoss()
loss = criterion(logits, y_float)  # Applies sigmoid internally, stably

In LSTM Gates

Sigmoid appears as a gating mechanism in Long Short-Term Memory networks:

Plaintext
LSTM gates use sigmoid for binary-like decisions:
  Forget gate: f_t = σ(W_f·[h_{t-1}, x_t] + b_f)
    → Values near 0: "forget this information"
    → Values near 1: "keep this information"

  Input gate: i_t = σ(W_i·[h_{t-1}, x_t] + b_i)
    → Controls how much new information to write

  Output gate: o_t = σ(W_o·[h_{t-1}, x_t] + b_o)
    → Controls what to expose from memory cell

Sigmoid is perfect for gates: output in (0,1) = how open/closed

Numerical Stability: A Critical Practical Concern

The Overflow Problem

Python
# Naive sigmoid can fail:
z_large_neg = -1000.0
naive = 1 / (1 + np.exp(-z_large_neg))  # np.exp(1000) → overflow!
# RuntimeWarning: overflow encountered in exp

# Fix 1: numpy handles large values with inf
import numpy as np
np.exp(1000)   # Returns inf
1 / (1 + np.inf)  # Returns 0.0 — correct!
# numpy's exp is smart enough to handle this for scalar

# Fix 2: Use scipy.special.expit (production-ready sigmoid)
from scipy.special import expit
safe_sigmoid = expit(np.array([-1000, -6, 0, 6, 1000]))
print(safe_sigmoid)
# [0.  0.00247 0.5  0.99753 1.]

# Fix 3: Two-branch implementation (our sigmoid_stable above)
def sigmoid_stable(z):
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),
        np.exp(z) / (1 + np.exp(z))
    )

The Log-of-Sigmoid Problem

Computing log(σ(z)) directly can underflow:

Python
def log_sigmoid(z):
    """
    Numerically stable log(σ(z)).
    Uses log-sum-exp trick.

    Direct: log(1/(1+e^(−z))) = −log(1+e^(−z))
    For large negative z: 1+e^(−z) → ∞, log → ∞ (overflow in exp)

    Stable version: −log(1+e^(−z)) = −softplus(−z)
    """
    return -np.logaddexp(0, -z)   # log(1/(1+e^{-z})) stably

def log_one_minus_sigmoid(z):
    """Numerically stable log(1 − σ(z)) = log(σ(−z))."""
    return -np.logaddexp(0, z)

# Test
z_test = np.array([-100, -10, 0, 10, 100])
print("log(σ(z)) comparison:")
for z_val in z_test:
    direct   = np.log(sigmoid_stable(z_val) + 1e-300)
    stable   = log_sigmoid(z_val)
    print(f"  z={z_val:5.0f}: direct={direct:.4f}  stable={stable:.4f}")

Binary Cross-Entropy: Stable Implementation

Python
def binary_cross_entropy_stable(y_true, z):
    """
    Numerically stable BCE from raw logits z (before sigmoid).

    Standard: −[y·log(σ(z)) + (1−y)·log(1−σ(z))]
    Stable:   max(z,0) − z·y + log(1 + e^(−|z|))

    This avoids computing σ(z) explicitly,
    preventing overflow in exp.
    """
    return np.maximum(z, 0) - z * y_true + np.log1p(np.exp(-np.abs(z)))

# Verify equivalence
z_sample = np.array([2.0, -1.5, 0.5, -3.0])
y_sample = np.array([1.0,  0.0, 1.0,  0.0])

bce_naive  = -( y_sample * np.log(sigmoid_stable(z_sample) + 1e-15)
               + (1-y_sample) * np.log(1 - sigmoid_stable(z_sample) + 1e-15))
bce_stable = binary_cross_entropy_stable(y_sample, z_sample)

print("BCE comparison (should be identical):")
print(f"  Naive:  {bce_naive.round(6)}")
print(f"  Stable: {bce_stable.round(6)}")

Sigmoid vs. Related Functions

Sigmoid vs. Tanh

Plaintext
Sigmoid:  σ(z) = 1/(1+e^(−z)),     range (0,1),  centered at 0.5
Tanh:   tanh(z) = (e^z−e^(−z))/(e^z+e^(−z)), range (−1,1), centered at 0

Relationship: tanh(z) = 2σ(2z) − 1

Properties compared:
  Sigmoid:  0 to 1         (useful for probabilities)
  Tanh:    −1 to 1         (zero-centered — better for hidden layers)

  Sigmoid: not zero-centered → can cause zig-zagging gradient updates
  Tanh:    zero-centered   → more stable gradient updates

  Both saturate in tails → vanishing gradient problem
  Both superseded by ReLU in hidden layers of deep networks
Python
z = np.linspace(-6, 6, 400)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Functions
ax1.plot(z, sigmoid_stable(z), 'steelblue', lw=2.5, label='Sigmoid (0,1)')
ax1.plot(z, np.tanh(z),        'coral',     lw=2.5, label='Tanh (−1,1)')
ax1.axhline(0, color='gray', linewidth=0.8, linestyle='-')
ax1.set_title('Sigmoid vs. Tanh: Output Range')
ax1.set_xlabel('z')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Derivatives
ax2.plot(z, sigmoid_derivative(z),       'steelblue', lw=2.5,
         label="σ'(z) — max 0.25")
ax2.plot(z, 1 - np.tanh(z)**2,           'coral',     lw=2.5,
         label="tanh'(z) — max 1.0")
ax2.set_title('Sigmoid vs. Tanh: Derivatives')
ax2.set_xlabel('z')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Sigmoid vs. Softmax

Plaintext
Sigmoid: For binary classification (2 classes)
  Output: Single probability P(class 1)
  P(class 0) = 1 − σ(z) automatically

Softmax: For multi-class classification (K > 2 classes)
  softmax(z)_k = exp(z_k) / Σⱼ exp(z_j)
  Outputs K probabilities that sum to 1

Relationship:
  Softmax with K=2 reduces to sigmoid:
  softmax([z, 0])₁ = exp(z)/(exp(z)+1) = σ(z)

Use sigmoid:  Binary output (single neuron)
Use softmax:  Multi-class output (K neurons)

Sigmoid vs. ReLU (for Hidden Layers)

Plaintext
Sigmoid:
  Range: (0, 1)
  Saturates: Yes — vanishing gradient problem
  Not zero-centered: Causes zig-zag gradient updates
  Computationally: exp() is expensive
  Used in: Output layers (binary), LSTM gates

ReLU:
  Range: [0, ∞)
  Saturates: Only for z < 0
  Not zero-centered: But sparse, helps computation
  Computationally: Very cheap (max(0,z))
  Used in: Hidden layers of almost all modern networks

Verdict: ReLU for hidden layers, Sigmoid for binary output layers

The Sigmoid in Probability Theory

Connection to the Bernoulli Distribution

Plaintext
The Bernoulli distribution models binary outcomes:
  P(y=1) = p
  P(y=0) = 1−p

Logistic regression models:
  p = P(y=1|x) = σ(wᵀx + b)

This means:
  y | x ~ Bernoulli(σ(wᵀx + b))

The sigmoid is the canonical link function for
the Bernoulli distribution in the generalized linear model (GLM) framework.

Connection to Information Theory

Plaintext
The binary cross-entropy loss:
  H(y, ŷ) = −[y·log(ŷ) + (1−y)·log(1−ŷ)]

Is the cross-entropy between the true distribution (y)
and the predicted distribution (ŷ = σ(z)).

Minimizing cross-entropy = Maximum likelihood estimation
of the Bernoulli distribution parameters.

This gives logistic regression its statistical foundation:
  Gradient descent on BCE = MLE for Bernoulli model

Practical Reference: When to Use Sigmoid

Use Sigmoid When:

Plaintext
✓ Binary classification output layer
  (Predict probability of one of two classes)

✓ LSTM/GRU gates
  (Control information flow: 0=closed, 1=open)

✓ Output layer for multi-label classification
  (Multiple independent binary decisions simultaneously)
  e.g., "Is image: [cat? dog? bird?]" — each independent

✓ When output must be a valid probability in (0,1)

✓ Attention mechanisms (sometimes)

Avoid Sigmoid When:

Plaintext
✗ Hidden layers of deep networks
  → Use ReLU, Leaky ReLU, or GELU instead
  → Sigmoid's vanishing gradient will slow/stop training

✗ Multi-class output (mutually exclusive classes)
  → Use Softmax instead (probabilities must sum to 1)

✗ When computing log(σ(z)) directly
  → Use log-sigmoid stable implementation

✗ When z values are very large/small
  → Ensure numerical stability implementation

Comparison Table: Sigmoid vs. Related Activation Functions

PropertySigmoidTanhReLUSoftmax
Formula1/(1+e^(−z))(e^z−e^(−z))/(e^z+e^(−z))max(0,z)e^zₖ/Σe^zⱼ
Output range(0, 1)(−1, 1)[0, ∞)(0,1), sums to 1
Zero-centeredNoYesNoN/A
SaturatesYes (both tails)Yes (both tails)Yes (z<0 only)No
Max derivative0.25 at z=01.0 at z=01 (constant)Varies
Vanishing gradientSevereModerateMild (dead ReLU)No
ComputationallyModerate (exp)Moderate (exp)Very cheapModerate
Use: hidden layersAvoidAvoidDefaultNo
Use: binary outputStandardNoNoNo
Use: multi-classNoNoNoStandard
Use: LSTM gatesStandardStandardNoNo

Conclusion: Simple Formula, Profound Impact

The sigmoid function is one of the most important functions in all of machine learning. A single formula — 1/(1+e^(−z)) — performs a transformation so useful that it appears at the heart of logistic regression, in every binary classification neural network output layer, in every LSTM gate, and in the foundation of the binary cross-entropy loss.

Its mathematical elegance is real, not incidental:

The range (0,1) makes it a natural probability interpreter — any linear score, no matter how large or small, becomes a valid probability with no additional constraints.

The symmetry σ(z) = 1−σ(−z) ensures that the model treats both classes consistently — the probability of class 1 at score z equals the probability of class 0 at score −z.

The derivative σ'(z) = σ(z)(1−σ(z)) is computationally free once σ(z) is computed in the forward pass, making backpropagation through sigmoid neurons efficient.

The logit inverse connects logistic regression to log-odds, giving coefficients an interpretable meaning in terms of how features affect the relative likelihood of outcomes.

The vanishing gradient in the tails is the function’s primary weakness — the reason ReLU replaced it in hidden layers. Understanding this limitation is just as important as understanding the function’s strengths.

Wherever in machine learning a number needs to become a probability, wherever a gate needs to open or close smoothly, wherever a binary decision needs a continuous differentiable foundation — sigmoid is the answer. That is why, despite being one of the oldest activation functions in the field, it remains irreplaceable at the output of every binary classifier built today.

Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Why Do Robots Need Gearboxes? Speed Versus Strength Explained

Why Do Robots Need Gearboxes? Speed Versus Strength Explained

Learn why gearboxes are essential in robotics, how gear ratios trade speed for torque, and…

Microsoft and Ford Announce AI-Powered Vehicle Assistant Launching in 2026

Ford announces AI-powered digital assistant developed with Microsoft and Google Cloud. The service launches in…

The Data Science Workflow: From Problem to Solution

Master the data science workflow with this comprehensive guide covering problem definition, data collection, cleaning,…

Harvey AI Legal Platform Targets $11 Billion Valuation in $200M Round

Harvey AI Legal Platform Targets $11 Billion Valuation in $200M Round

Legal AI startup Harvey negotiates a $200 million funding round led by Sequoia and GIC…

Making Decisions in C++: If-Else Statements for Beginners

Learn how to make decisions in C++ using if-else statements. Master conditional logic, nested conditions,…

What Are Semiconductors and Why Did They Change Everything?

Discover what semiconductors are, how they work at the atomic level, and why they revolutionized…

Click For More
0
Would love your thoughts, please comment.x
()
x