Activation Functions in Deep Learning

01

The Original · Binary Threshold

The Step Function

The light switch that started it all.

The step function — also called the Heaviside function — is the oldest activation function in the history of neural networks. It was introduced as part of the McCulloch-Pitts neuron model in 1943, inspired by the biological neuron's behavior: either it fires or it doesn't. This binary, all-or-nothing model was the first attempt to mathematically capture how real neurons work.

Mathematical Definition

f(x) = 1 if x ≥ 0 f(x) = 0 if x < 0 f'(x) = 0 everywhere (undefined at x=0)

Real-World Analogy

Think of a light switch. You flip it up — the light is fully on. You flip it down — the light is fully off. There is no "dimming." There is no degree of onness. The switch doesn't know how far above the threshold you pushed it; it only knows that you did. This is precisely what the step function does: it transforms any weighted sum above zero into a full "1" output and anything below into a flat "0".

How It Works

Imagine a neuron that receives three inputs: x₁ = 0.8 (how bright is the light), x₂ = 0.3 (how close is the object), x₃ = -0.5 (is there noise?). The weights are w₁ = 0.6, w₂ = 0.4, w₃ = 0.2, and the bias b = -0.3. The weighted sum z is:

Worked Example · Step Function

Inputs: x₁=0.8, x₂=0.3, x₃=−0.5 | Weights: w₁=0.6, w₂=0.4, w₃=0.2 | Bias: b=−0.3

z = (0.8×0.6) + (0.3×0.4) + (−0.5×0.2) + (−0.3)

z = 0.48 + 0.12 − 0.10 − 0.30 = 0.20

Since z = 0.20 ≥ 0 → f(z) = 1. The neuron fires.

If the bias were −0.5 instead: z = 0.00 → still fires (x≥0 rule).

If the bias were −0.6: z = −0.10 → f(z) = 0. The neuron doesn't fire.

The Fatal Flaw: Zero Gradient

The step function's output is either 0 or 1 — everywhere. This means its derivative is 0 everywhere (and undefined at x=0). In deep learning, we train networks using backpropagation: we compute how much each weight contributed to the error, and nudge the weights in the right direction. That computation requires the gradient of the activation function.

If the gradient is always zero, backpropagation has nothing to work with. No matter how wrong the network's output is, the gradient signal cannot travel backward through a step function. The weights never update. The network never learns. This isn't a minor inconvenience — it's a complete breakdown of the learning algorithm.

"The step function introduced the idea of threshold-based neural firing. But it gave us a network that could represent decisions yet could never learn to make better ones."

On the history of activation functions in connectionism

Historical Significance

Despite its unusability in modern deep learning, the step function was genuinely important. It established the idea that a neuron's output could be a nonlinear function of its inputs. It gave rise to the perceptron algorithm in the 1950s and directly motivated the search for differentiable alternatives — which led to the sigmoid function a few decades later. Every activation function that follows in this article is, in some sense, a response to the step function's limitations.

Property	Value	Notes
Output range	{0, 1}	Binary only
Differentiable?	No	FATAL for backprop
Vanishing gradient?	Gradient is always 0	Cannot train
Zero-centered?	No (outputs 0 or 1)	Biased updates
Use today?	Never in hidden layers	Only for concept illustration

02

The Classic · Probability Gate

Sigmoid

The smooth S-curve that made backpropagation possible.

The sigmoid function emerged as the natural differentiable alternative to the step function. If we want a neuron that behaves like an on/off switch but can also be trained by backpropagation, we need something that looks like a smooth, continuous version of the step function — and that's exactly what the sigmoid delivers. Its characteristic S-shaped curve transitions smoothly from 0 to 1, with the steepest slope at x=0 and gentle saturation at both extremes.

Mathematical Definition

σ(x) = 1 / (1 + e⁻ˣ) σ'(x) = σ(x) · (1 − σ(x)) Maximum gradient = 0.25, at x = 0

Real-World Analogy

Picture a hospital triage nurse deciding whether to escalate a patient to the ICU. A patient with no symptoms at all (x → −∞) has near-zero probability of needing escalation. A patient in cardiac arrest (x → +∞) has near-certainty. But for the vast middle ground — the ambiguous cases — the nurse weighs dozens of signals and returns a probability. Small changes in these middle-ground cases dramatically change the output; extreme cases hardly change at all. The sigmoid encodes exactly this intuition: sensitivity in the uncertain middle, certainty at the extremes.

The Elegant Gradient

One of the reasons the sigmoid became so popular is that its derivative has a beautiful closed-form expression. If you already know σ(x), computing σ'(x) is free — you just multiply σ(x) by (1 − σ(x)). This made backpropagation implementations clean and efficient in an era when computational resources were precious.

Worked Example · Sigmoid

Consider a binary classifier predicting whether an email is spam. The last layer outputs a raw score z = 2.1.

σ(2.1) = 1 / (1 + e⁻²·¹) = 1 / (1 + 0.1225) = 1 / 1.1225 ≈ 0.891

The model outputs 89.1% probability of spam. If the true label is "spam" (y=1), the gradient of the loss with respect to the pre-activation score is simply σ(z) − y = 0.891 − 1 = −0.109.

Now for a neutral email, z = −0.4:

σ(−0.4) = 1 / (1 + e⁰·⁴) = 1 / (1 + 1.492) ≈ 0.401

40.1% probability of spam — correctly uncertain.

The Vanishing Gradient Problem

The sigmoid's maximum gradient is 0.25 — reached only at x=0. For |x| > 2, the gradient is already below 0.05. For |x| > 4, it's below 0.002. In a deep network, the backward pass multiplies these gradients together across layers. In a 10-layer network where every neuron is saturated, the gradient reaching the first layer can be (0.002)¹⁰ = 10⁻³⁰. The weights in early layers effectively stop updating — they are frozen by mathematics, not design.

This phenomenon, called the vanishing gradient problem, was a primary reason deep networks were considered untrainable through the 1990s and early 2000s. It was only addressed when ReLU arrived and residual connections were introduced.

Non-Zero-Centered Outputs

A subtler problem: sigmoid outputs are always positive (between 0 and 1). When these outputs feed into the next layer as inputs, all gradients for the next layer's weights will have the same sign. This forces weight updates to be either all positive or all negative simultaneously — a "zig-zag" dynamics during optimization that slows convergence. Tanh addresses exactly this issue by centering its outputs around 0.

Property	Value	Notes
Output range	(0, 1)	Probability interpretation
Differentiable?	Yes, everywhere	Smooth gradient
Max gradient	0.25 at x=0	Vanishing in deep nets
Zero-centered?	No (0 to 1)	Zig-zag updates
Best use	Output layer only	Binary classification

03

The Centered Sigmoid · Zero-Mean

Tanh

Sigmoid's smarter sibling — centered at zero, stronger gradients.

The hyperbolic tangent function — tanh — emerged as a direct improvement over sigmoid by solving one of its main problems: non-zero-centered outputs. While sigmoid outputs values between 0 and 1, tanh outputs values between -1 and 1, with the mean of the distribution centered at 0. This seemingly small change has significant consequences for how efficiently a network can learn.

Mathematically, tanh is actually a scaled and shifted version of sigmoid: tanh(x) = 2·σ(2x) − 1. The shape is identical — an S-curve — but stretched vertically to span from −1 to +1 instead of 0 to 1. The maximum gradient of tanh is 1.0 (four times higher than sigmoid's 0.25), meaning gradient information can travel further back through the network before vanishing.

Mathematical Definition

tanh(x) = (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ) tanh'(x) = 1 − tanh²(x) Maximum gradient = 1.0, at x = 0

Real-World Analogy

Imagine a film critic rating movies on a scale from −1 (complete disaster) to +1 (masterpiece), with 0 meaning "perfectly average." A critic who has seen 10,000 films rarely goes to extremes — a film has to be extraordinarily bad or extraordinarily good to move them far from center. Most films cluster around the middle. This is exactly how tanh behaves: it expresses strong opinions only when the evidence is overwhelming; otherwise, it returns a nuanced, centered signal.

Why Zero-Centering Matters

Consider what happens when a layer's outputs feed into the next layer's weights. If every input to a weight is positive (as with sigmoid), then the gradient with respect to that weight is either always positive or always negative (depending on the upstream gradient). This means all weights in a given neuron update in the same direction simultaneously. The optimizer can't decrease some weights while increasing others — it has to do a series of zig-zagging steps. Zero-centered inputs from tanh allow weight updates to have mixed signs, enabling more direct paths through the loss landscape.

Worked Example · Tanh in a Sentiment Classifier

A sentiment analysis network processes the phrase "the movie was not bad." The hidden layer neuron receives z = −1.5 (a moderately negative signal before tanh):

tanh(−1.5) = (e⁻¹·⁵ − e¹·⁵) / (e⁻¹·⁵ + e¹·⁵) = (0.223 − 4.482) / (0.223 + 4.482) ≈ −0.905

The gradient for backprop: tanh'(−1.5) = 1 − (−0.905)² = 1 − 0.819 = 0.181

Now for a weakly positive signal z = 0.3:

tanh(0.3) ≈ 0.291 and tanh'(0.3) = 1 − 0.291² ≈ 0.915

Notice how the gradient 0.915 is nearly 5× larger than the sigmoid's maximum gradient. This is the practical advantage in shallow networks.

Still Suffers at the Extremes

For all its advantages over sigmoid, tanh shares the same fundamental problem: saturation at the extremes. When |x| is large, tanh(x) approaches ±1 and the gradient approaches 0. In a deep network, neurons that consistently receive large inputs will saturate, and their gradients will vanish. The problem is less severe than sigmoid (because the gradient peak is 1.0 vs 0.25), but it's not solved.

This is why tanh remained the preferred choice for shallow networks (2–4 layers) through the 2000s, but couldn't unlock the performance of very deep architectures. That breakthrough had to wait for ReLU.

Where Tanh Still Dominates

Even today, tanh remains the standard activation inside LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) cells. The reason is specific to recurrent networks: the values flowing through the recurrent state need to stay bounded and centered, or they explode or vanish over time steps. Tanh's bounded output range (−1, 1) provides natural stability that ReLU cannot — ReLU's unbounded positive output would allow recurrent states to grow without limit.

Property	Value	Notes
Output range	(−1, 1)	Zero-centered
Max gradient	1.0 at x=0	4× stronger than sigmoid
Vanishing gradient?	Yes, at extremes	Same issue as sigmoid
Best use	RNNs, LSTMs, shallow nets	Industry standard for RNN

04

The Workhorse · Deep Learning Revolution

ReLU

The function that unlocked truly deep networks.

When Krizhevsky, Sutskever, and Hinton published AlexNet in 2012 and won the ImageNet competition by a margin that shocked the field, one of their key design choices was deceptively simple: they used Rectified Linear Units everywhere instead of sigmoid or tanh. ReLU is, mathematically speaking, the simplest possible nonlinear function — it sets all negative values to zero and passes positive values through unchanged. And yet this trivial-looking function transformed what was possible in deep learning.

Mathematical Definition

f(x) = max(0, x) f'(x) = 1 if x > 0 f'(x) = 0 if x ≤ 0

Real-World Analogy

Think of a sales commission structure at a startup. If a salesperson makes zero or negative profit for the company this month, they earn nothing extra — zero commission. But for every dollar of profit above zero, they earn a proportional cut. There's no cap on how much they can earn for extraordinary performance — the reward scales linearly with results. This is ReLU: no reward for negative performance, proportional reward for positive performance, unlimited upside. Simple, fair, and computationally trivial to evaluate.

Why ReLU Works So Well

Three properties explain ReLU's dominance. First, it solves the vanishing gradient problem for positive inputs: the gradient is exactly 1 for all x > 0. No matter how deep the network, the gradient signal travels back through active ReLU neurons with no attenuation whatsoever. This allowed networks with 10, 50, and eventually hundreds of layers to be trained effectively for the first time.

Second, ReLU induces sparsity. In any given forward pass, roughly half of the neurons in a ReLU network output exactly zero — they are "off." This sparse representation has desirable properties: it reduces the effective model complexity, makes computations faster, and has been argued to mirror how the brain encodes information (most neurons are silent at any given moment).

Third, ReLU is computationally trivial. Evaluating max(0, x) requires one comparison and one branch. There is no exponentiation, no division. This makes ReLU orders of magnitude faster to compute than sigmoid or tanh, which matters enormously when you're applying it millions of times per second during training.

Worked Example · ReLU in an Image Classifier

A convolutional neural network detects edges in an image. After a convolution, a filter produces the following activations for a 1×5 slice: [−2.1, 0.8, 3.4, −0.3, 1.2]

After ReLU: [0, 0.8, 3.4, 0, 1.2]

Two neurons are "dead" in this pass. The network only propagates information from positions 2, 3, and 5 — where the edge signal was strong and positive. The 3.4 remains fully intact; no sigmoid squashing here.

Gradient during backprop for each position:

[0, 1, 1, 0, 1] — only positions 2, 3, 5 receive gradient. Positions 1 and 4 are blocked.

The Dying ReLU Problem

ReLU's most significant weakness is the "dying ReLU" phenomenon. A neuron "dies" when it gets stuck in a state where it always outputs 0. This happens when the weights become arranged such that the neuron's pre-activation value is negative for every input in the training dataset. In this state, the gradient is always 0, the weights never update, and the neuron is permanently frozen — it contributes nothing to the network for the rest of training.

In practice, 10–20% of neurons can die in poorly initialized or poorly configured networks. A large learning rate is a common culprit: a single aggressive gradient update can push many neurons into the permanently-off region. Proper weight initialization (He initialization is designed specifically for ReLU) and careful learning rate tuning mitigate this, but don't eliminate it. This vulnerability directly motivated Leaky ReLU and its variants.

"Using ReLUs instead of sigmoid functions is probably the single most important practical improvement to training deep networks. Not because of elegance — max(0,x) is about as inelegant as it gets — but because it just works."

Common sentiment in the deep learning community post-AlexNet

He Initialization: ReLU's Partner

Because ReLU zeros out negative inputs, it changes the effective variance of activations as you go deeper. Without correction, the variance either explodes or collapses as layers pile up. He initialization (Kaiming He, 2015) sets the initial weight variance to 2/n (where n is the number of incoming connections), exactly compensating for ReLU's zeroing. Using sigmoid-style initialization (1/n) with ReLU networks is a common mistake that degrades performance significantly.

Property	Value	Notes
Output range	[0, ∞)	Unbounded positive
Gradient (positive)	Always 1	No vanishing gradient
Gradient (negative)	Always 0	Dying ReLU risk
Computation	Trivial	Fastest activation
Zero-centered?	No (non-negative)	Slight optimization inefficiency
Best use	Hidden layers, CNNs	Default for most tasks

05

The Survivor · Fixing Dead Neurons

Leaky ReLU

ReLU with a lifeline for negative inputs.

Leaky ReLU was proposed in 2013 as a direct response to the dying neuron problem. The idea is almost embarrassingly simple: instead of clamping all negative values to exactly zero, allow a small, non-zero slope for negative inputs. A neuron that receives a negative input doesn't go silent — it passes along a tiny, attenuated signal. This tiny signal keeps the gradient alive. The neuron remains "on" in a minimal sense, and backpropagation can still update its weights.

Mathematical Definition

f(x) = x if x > 0 f(x) = αx if x ≤ 0 (α typically = 0.01) f'(x) = 1 if x > 0 f'(x) = α if x ≤ 0

Real-World Analogy

Imagine a consultant who normally earns a full day rate when they have active client work. In months with no active projects, they don't earn zero — they do small retainer work, write articles, or take training courses at a small fixed rate. They never go completely dark. They keep their skills current. When a new project arrives, they're ready to re-engage at full capacity. The small "leaky" stipend is exactly α — usually 0.01 times the normal rate — just enough to stay in the game.

The α Hyperparameter

The slope α for negative inputs is a hyperparameter. The standard value is 0.01, chosen to be small enough not to fundamentally change the function's behavior for positive inputs, but large enough to keep gradients alive. Some variants treat α as a learnable parameter (Parametric ReLU, or PReLU, proposed by Kaiming He in 2015), allowing the network to discover the optimal negative slope for each neuron independently.

Worked Example · Comparing ReLU vs Leaky ReLU

Suppose a neuron consistently receives z = −0.8 across all training examples in a batch. With α = 0.01:

Standard ReLU: f(−0.8) = 0. Gradient = 0. The weight update is: Δw = 0 × (upstream_gradient). Dead neuron.

Leaky ReLU: f(−0.8) = 0.01 × −0.8 = −0.008. Gradient = 0.01.

If the upstream gradient is −0.5: Δw = 0.01 × (−0.5) = −0.005. Small, but non-zero. The neuron nudges its weights in the right direction and can recover.

Over 1000 gradient steps: ReLU accumulates Δw = 0 (perpetually dead). Leaky ReLU accumulates Δw = −5, potentially enough to push the neuron back into positive territory and restore full activity.

ELU and SELU: Taking It Further

The Exponential Linear Unit (ELU) takes the leaking idea further by using an exponential curve for negative inputs: f(x) = α(eˣ − 1) for x ≤ 0. This produces a smooth transition at x=0 (no kink) and saturates at −α for very negative inputs. The Scaled ELU (SELU) adds a self-normalizing property: under certain conditions, SELU networks automatically maintain mean-zero and unit-variance activations through all layers, eliminating the need for batch normalization entirely.

In practice, Leaky ReLU with α=0.01 remains the most widely used variant because it's simple, computationally identical to ReLU, and doesn't introduce the exponential computation that ELU requires.

Property	Value	Notes
Output range	(−∞, ∞)	Fully unbounded
Gradient (positive)	1	Same as ReLU
Gradient (negative)	α (e.g., 0.01)	Neurons stay alive
Dying neurons?	No	Core improvement over ReLU
Computation	Same as ReLU	No overhead
Hyperparameter	α to tune	Usually just use 0.01

06

The Distributor · Multi-Class Output

Softmax

Turning raw scores into a proper probability distribution.

Softmax is fundamentally different from all other activation functions in this article. Every other function operates on a single scalar value and transforms it independently. Softmax operates on an entire vector: it takes a vector of raw scores (called logits) and transforms them into a probability distribution — a set of non-negative values that sum to exactly 1. This property makes it the natural choice for the output layer of any multi-class classification problem.

Mathematical Definition

softmax(xᵢ) = exp(xᵢ) / Σⱼ exp(xⱼ) For all i: output ∈ (0, 1), and Σᵢ output = 1 Jacobian: ∂yᵢ/∂xⱼ = yᵢ(δᵢⱼ − yⱼ) where δᵢⱼ is the Kronecker delta

Real-World Analogy

Imagine an election with five candidates. Each candidate has a raw "popularity score": some positive, some negative, some wildly different in scale. An election consultant needs to convert these messy scores into a percentage breakdown — a proper probability distribution over who will win. Softmax is exactly this conversion process: it exponentiates every score to make them all positive, then divides each by the total. The most popular candidate gets the largest slice. The least popular gets the smallest. Every candidate gets something. And the percentages sum to exactly 100%.

The Role of the Exponential

The choice of the exponential function in softmax isn't arbitrary — it has two important properties. First, exponentiation makes all values positive, regardless of whether the logits are negative. Second, it amplifies differences: a score of 3.0 vs a score of 2.0 leads to e³/e² = e ≈ 2.72× ratio in the numerators, not 1.5×. This means softmax has a "winner-takes-more" behavior: the class with the highest score gets disproportionately amplified, making the output distribution sharper and more decisive.

Worked Example · Softmax for 4-Class Classifier

An image classifier outputs logits for 4 classes: Cat=2.1, Dog=1.5, Bird=−0.3, Fish=0.8

Step 1 — Exponentiate: e²·¹=8.17, e¹·⁵=4.48, e⁻⁰·³=0.74, e⁰·⁸=2.23

Step 2 — Sum: 8.17 + 4.48 + 0.74 + 2.23 = 15.62

Step 3 — Normalize:

Cat: 8.17 / 15.62 = 0.523 (52.3%)

Dog: 4.48 / 15.62 = 0.287 (28.7%)

Bird: 0.74 / 15.62 = 0.047 (4.7%)

Fish: 2.23 / 15.62 = 0.143 (14.3%)

Sum: 0.523 + 0.287 + 0.047 + 0.143 = 1.000 ✓

Softmax + Cross-Entropy: The Standard Pairing

Softmax is almost never used alone — it's always paired with cross-entropy loss. The cross-entropy loss for a correct class c is: L = −log(softmax(x)_c). When you compute the gradient of this combined loss with respect to the logits, you get an extraordinarily clean result: ∂L/∂xᵢ = softmax(xᵢ) − 1 for the correct class, and softmax(xᵢ) for all other classes. This elegant gradient is one reason the softmax + cross-entropy combination is so widely used — the math works out beautifully.

Numerical Stability: The Log-Sum-Exp Trick

A practical caveat: naively computing exp(x) for large x overflows floating-point arithmetic. exp(1000) is infinity in IEEE 754 float32. The standard solution is to subtract the maximum logit before exponentiating: compute exp(xᵢ − max(x)) instead of exp(xᵢ). This doesn't change the output (the normalization cancels the shift), but ensures that at least one logit exponentiates to exp(0) = 1, keeping all values in a numerically stable range. All major deep learning frameworks implement this automatically.

Property	Value	Notes
Input	A vector of logits	Not a scalar function
Output	Probability distribution	Sums to 1
Winner-takes-more	Yes	Sharp distributions for large gaps
Used in	Output layer only	Multi-class classification
Pair with	Cross-entropy loss	Clean, stable gradient

07

The Modern Default · LLM-Era Activation

GELU

The probabilistic gating function powering GPT, BERT, and beyond.

Introduced by Hendrycks and Gimpel in 2016, the Gaussian Error Linear Unit quietly became the most important activation function of the transformer era. While ReLU makes a hard, binary decision about each input (positive → pass through, negative → block), GELU makes a probabilistic decision: it weights each input by the probability that a Gaussian random variable would be less than that input. The result is a smooth, non-monotonic function that combines the sparsity benefits of ReLU with the smooth probabilistic intuition of sigmoid.

Mathematical Definition

GELU(x) = x · Φ(x) = x · (1/2)[1 + erf(x / √2)] Fast approximation: GELU(x) ≈ 0.5x · (1 + tanh(√(2/π) · (x + 0.044715x³))) where Φ(x) is the Gaussian CDF, erf is the error function

Real-World Analogy

Imagine a high-frequency trading algorithm deciding how much of an incoming signal to act on. For a very strong positive signal, the algorithm trusts it almost completely — the probability that such a strong signal is noise is near zero, so it acts on nearly 100% of it. For a very strong negative signal, it also rejects it almost completely. But for ambiguous signals near zero — signals that could plausibly be noise — the algorithm acts on them proportionally to how likely they are to be genuine. A signal at x=0.5 might only be acted on 60% of the way. This probability-weighted approach is exactly GELU: multiply the input by the probability that it's a meaningful signal.

Why GELU Became the Default for Transformers

The original BERT paper (2018) used GELU and reported improvements over ReLU and tanh. GPT-2 (2019) adopted GELU. GPT-3, LLaMA, PaLM, and nearly every transformer architecture since have followed suit. The empirical evidence is clear: on language tasks especially, GELU consistently outperforms ReLU and its variants.

The theoretical intuition is that language is probabilistic in nature. A token's "activation" in a neural network represents evidence for some feature. The question "how much should I activate on this evidence?" is naturally answered probabilistically — stronger evidence earns proportionally stronger activation, with no hard cutoff. GELU's soft gating matches this intuition in a way that ReLU's hard threshold does not.

Worked Example · GELU vs ReLU on Identical Inputs

Suppose a transformer attention layer output has pre-activation values: [−1.0, −0.3, 0.0, 0.5, 2.0]

ReLU output: [0, 0, 0, 0.5, 2.0]

GELU output:

GELU(−1.0) ≈ −1.0 × Φ(−1.0) = −1.0 × 0.159 ≈ −0.159

GELU(−0.3) ≈ −0.3 × Φ(−0.3) = −0.3 × 0.382 ≈ −0.115

GELU(0.0) = 0.0 × 0.5 = 0.000

GELU(0.5) ≈ 0.5 × Φ(0.5) = 0.5 × 0.691 ≈ 0.346

GELU(2.0) ≈ 2.0 × Φ(2.0) = 2.0 × 0.977 ≈ 1.954

GELU output: [−0.159, −0.115, 0.000, 0.346, 1.954]

Key difference: GELU passes small negative values with attenuated negative output. ReLU kills all negatives. GELU also attenuates positive values near zero — 0.5 becomes 0.346, not a full 0.5.

The Non-Monotonic Property

GELU is non-monotonic near x=0: it has a small dip below 0 for slightly negative values (around x = −0.17, GELU reaches its minimum of approximately −0.17). This means a slightly more negative input actually produces a slightly more negative output — which seems counterintuitive but is mathematically meaningful. This non-monotonicity has been theorized to help the model represent more complex feature interactions.

SwiGLU: GELU's Successor in Modern LLMs

The most recent large language models — LLaMA 2, Mistral, Gemma, and others — use a variant called SwiGLU (Swish-Gated Linear Unit). SwiGLU combines a gating mechanism with the Swish activation (a close relative of GELU) to create a two-stream architecture in the feedforward blocks of transformers. SwiGLU has empirically outperformed GELU on large-scale language modeling, and has become the new default in frontier models.

Property	Value	Notes
Output range	(−∞, ∞)	Unbounded but soft-clipped
Smooth?	Yes, infinitely	No kink at x=0 unlike ReLU
Non-monotonic?	Yes, near x=0	More expressive
Computation	Slower than ReLU	Needs erf or tanh approx
Standard in	BERT, GPT, LLaMA	Transformer default

Activation Functions:
The Decision Makers
of Neural Networks

The Step Function

How It Works

The Fatal Flaw: Zero Gradient

Historical Significance

Sigmoid

The Elegant Gradient

The Vanishing Gradient Problem

Non-Zero-Centered Outputs

Tanh

Why Zero-Centering Matters

Still Suffers at the Extremes

Where Tanh Still Dominates

ReLU

Why ReLU Works So Well

The Dying ReLU Problem

He Initialization: ReLU's Partner

Leaky ReLU

The α Hyperparameter

ELU and SELU: Taking It Further

Softmax

The Role of the Exponential

Softmax + Cross-Entropy: The Standard Pairing

Numerical Stability: The Log-Sum-Exp Trick

GELU

Why GELU Became the Default for Transformers

The Non-Monotonic Property

SwiGLU: GELU's Successor in Modern LLMs

Quick Reference: When to Use Each