Deep Learning Fundamentals · A Complete Reference

Activation Functions:
The Decision Makers
of Neural Networks

A detailed, article-by-article breakdown of every major activation function — with real-world analogies, mathematical derivations, and concrete worked examples.

Every neuron in a deep neural network performs two operations: a weighted sum of its inputs, and then a transformation of that sum. That second step — the transformation — is the job of the activation function. It sounds simple. But the choice of activation function is one of the most consequential decisions in the architecture of a neural network.

Without activation functions, a neural network — no matter how many layers it has — is nothing more than a linear regression model. Every layer would simply multiply its input by a matrix, and the composition of a thousand linear transformations is still just one linear transformation. The non-linearity introduced by activation functions is what allows networks to approximate any function, learn curved decision boundaries, and ultimately recognize faces, translate languages, and generate text.

This article covers seven activation functions in depth: where they come from, why they work, where they fail, and how to choose between them. Each is explained with a real-world analogy, a worked numerical example, and the mathematical derivation of its gradient — the quantity that makes learning possible.

01
The Original · Binary Threshold

The Step Function

The light switch that started it all.

The step function — also called the Heaviside function — is the oldest activation function in the history of neural networks. It was introduced as part of the McCulloch-Pitts neuron model in 1943, inspired by the biological neuron's behavior: either it fires or it doesn't. This binary, all-or-nothing model was the first attempt to mathematically capture how real neurons work.

Mathematical Definition
f(x) = 1 if x ≥ 0 f(x) = 0 if x < 0 f'(x) = 0 everywhere (undefined at x=0)
Real-World Analogy

Think of a light switch. You flip it up — the light is fully on. You flip it down — the light is fully off. There is no "dimming." There is no degree of onness. The switch doesn't know how far above the threshold you pushed it; it only knows that you did. This is precisely what the step function does: it transforms any weighted sum above zero into a full "1" output and anything below into a flat "0".

How It Works

Imagine a neuron that receives three inputs: x₁ = 0.8 (how bright is the light), x₂ = 0.3 (how close is the object), x₃ = -0.5 (is there noise?). The weights are w₁ = 0.6, w₂ = 0.4, w₃ = 0.2, and the bias b = -0.3. The weighted sum z is:

Worked Example · Step Function

Inputs: x₁=0.8, x₂=0.3, x₃=−0.5 | Weights: w₁=0.6, w₂=0.4, w₃=0.2 | Bias: b=−0.3

z = (0.8×0.6) + (0.3×0.4) + (−0.5×0.2) + (−0.3)

z = 0.48 + 0.12 − 0.10 − 0.30 = 0.20

Since z = 0.20 ≥ 0 → f(z) = 1. The neuron fires.

If the bias were −0.5 instead: z = 0.00 → still fires (x≥0 rule).

If the bias were −0.6: z = −0.10 → f(z) = 0. The neuron doesn't fire.

The Fatal Flaw: Zero Gradient

The step function's output is either 0 or 1 — everywhere. This means its derivative is 0 everywhere (and undefined at x=0). In deep learning, we train networks using backpropagation: we compute how much each weight contributed to the error, and nudge the weights in the right direction. That computation requires the gradient of the activation function.

If the gradient is always zero, backpropagation has nothing to work with. No matter how wrong the network's output is, the gradient signal cannot travel backward through a step function. The weights never update. The network never learns. This isn't a minor inconvenience — it's a complete breakdown of the learning algorithm.

"The step function introduced the idea of threshold-based neural firing. But it gave us a network that could represent decisions yet could never learn to make better ones."

On the history of activation functions in connectionism

Historical Significance

Despite its unusability in modern deep learning, the step function was genuinely important. It established the idea that a neuron's output could be a nonlinear function of its inputs. It gave rise to the perceptron algorithm in the 1950s and directly motivated the search for differentiable alternatives — which led to the sigmoid function a few decades later. Every activation function that follows in this article is, in some sense, a response to the step function's limitations.

PropertyValueNotes
Output range{0, 1}Binary only
Differentiable?NoFATAL for backprop
Vanishing gradient?Gradient is always 0Cannot train
Zero-centered?No (outputs 0 or 1)Biased updates
Use today?Never in hidden layersOnly for concept illustration
02
The Classic · Probability Gate

Sigmoid

The smooth S-curve that made backpropagation possible.

The sigmoid function emerged as the natural differentiable alternative to the step function. If we want a neuron that behaves like an on/off switch but can also be trained by backpropagation, we need something that looks like a smooth, continuous version of the step function — and that's exactly what the sigmoid delivers. Its characteristic S-shaped curve transitions smoothly from 0 to 1, with the steepest slope at x=0 and gentle saturation at both extremes.

Mathematical Definition
σ(x) = 1 / (1 + e⁻ˣ) σ'(x) = σ(x) · (1 − σ(x)) Maximum gradient = 0.25, at x = 0
Real-World Analogy

Picture a hospital triage nurse deciding whether to escalate a patient to the ICU. A patient with no symptoms at all (x → −∞) has near-zero probability of needing escalation. A patient in cardiac arrest (x → +∞) has near-certainty. But for the vast middle ground — the ambiguous cases — the nurse weighs dozens of signals and returns a probability. Small changes in these middle-ground cases dramatically change the output; extreme cases hardly change at all. The sigmoid encodes exactly this intuition: sensitivity in the uncertain middle, certainty at the extremes.

The Elegant Gradient

One of the reasons the sigmoid became so popular is that its derivative has a beautiful closed-form expression. If you already know σ(x), computing σ'(x) is free — you just multiply σ(x) by (1 − σ(x)). This made backpropagation implementations clean and efficient in an era when computational resources were precious.

Worked Example · Sigmoid

Consider a binary classifier predicting whether an email is spam. The last layer outputs a raw score z = 2.1.

σ(2.1) = 1 / (1 + e⁻²·¹) = 1 / (1 + 0.1225) = 1 / 1.1225 ≈ 0.891

The model outputs 89.1% probability of spam. If the true label is "spam" (y=1), the gradient of the loss with respect to the pre-activation score is simply σ(z) − y = 0.891 − 1 = −0.109.

Now for a neutral email, z = −0.4:

σ(−0.4) = 1 / (1 + e⁰·⁴) = 1 / (1 + 1.492) ≈ 0.401

40.1% probability of spam — correctly uncertain.

The Vanishing Gradient Problem

The sigmoid's maximum gradient is 0.25 — reached only at x=0. For |x| > 2, the gradient is already below 0.05. For |x| > 4, it's below 0.002. In a deep network, the backward pass multiplies these gradients together across layers. In a 10-layer network where every neuron is saturated, the gradient reaching the first layer can be (0.002)¹⁰ = 10⁻³⁰. The weights in early layers effectively stop updating — they are frozen by mathematics, not design.

This phenomenon, called the vanishing gradient problem, was a primary reason deep networks were considered untrainable through the 1990s and early 2000s. It was only addressed when ReLU arrived and residual connections were introduced.

Non-Zero-Centered Outputs

A subtler problem: sigmoid outputs are always positive (between 0 and 1). When these outputs feed into the next layer as inputs, all gradients for the next layer's weights will have the same sign. This forces weight updates to be either all positive or all negative simultaneously — a "zig-zag" dynamics during optimization that slows convergence. Tanh addresses exactly this issue by centering its outputs around 0.

PropertyValueNotes
Output range(0, 1)Probability interpretation
Differentiable?Yes, everywhereSmooth gradient
Max gradient0.25 at x=0Vanishing in deep nets
Zero-centered?No (0 to 1)Zig-zag updates
Best useOutput layer onlyBinary classification
03
The Centered Sigmoid · Zero-Mean

Tanh

Sigmoid's smarter sibling — centered at zero, stronger gradients.

The hyperbolic tangent function — tanh — emerged as a direct improvement over sigmoid by solving one of its main problems: non-zero-centered outputs. While sigmoid outputs values between 0 and 1, tanh outputs values between -1 and 1, with the mean of the distribution centered at 0. This seemingly small change has significant consequences for how efficiently a network can learn.

Mathematically, tanh is actually a scaled and shifted version of sigmoid: tanh(x) = 2·σ(2x) − 1. The shape is identical — an S-curve — but stretched vertically to span from −1 to +1 instead of 0 to 1. The maximum gradient of tanh is 1.0 (four times higher than sigmoid's 0.25), meaning gradient information can travel further back through the network before vanishing.

Mathematical Definition
tanh(x) = (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ) tanh'(x) = 1 − tanh²(x) Maximum gradient = 1.0, at x = 0
Real-World Analogy

Imagine a film critic rating movies on a scale from −1 (complete disaster) to +1 (masterpiece), with 0 meaning "perfectly average." A critic who has seen 10,000 films rarely goes to extremes — a film has to be extraordinarily bad or extraordinarily good to move them far from center. Most films cluster around the middle. This is exactly how tanh behaves: it expresses strong opinions only when the evidence is overwhelming; otherwise, it returns a nuanced, centered signal.

Why Zero-Centering Matters

Consider what happens when a layer's outputs feed into the next layer's weights. If every input to a weight is positive (as with sigmoid), then the gradient with respect to that weight is either always positive or always negative (depending on the upstream gradient). This means all weights in a given neuron update in the same direction simultaneously. The optimizer can't decrease some weights while increasing others — it has to do a series of zig-zagging steps. Zero-centered inputs from tanh allow weight updates to have mixed signs, enabling more direct paths through the loss landscape.

Worked Example · Tanh in a Sentiment Classifier

A sentiment analysis network processes the phrase "the movie was not bad." The hidden layer neuron receives z = −1.5 (a moderately negative signal before tanh):

tanh(−1.5) = (e⁻¹·⁵ − e¹·⁵) / (e⁻¹·⁵ + e¹·⁵) = (0.223 − 4.482) / (0.223 + 4.482) ≈ −0.905

The gradient for backprop: tanh'(−1.5) = 1 − (−0.905)² = 1 − 0.819 = 0.181

Now for a weakly positive signal z = 0.3:

tanh(0.3) ≈ 0.291 and tanh'(0.3) = 1 − 0.291² ≈ 0.915

Notice how the gradient 0.915 is nearly 5× larger than the sigmoid's maximum gradient. This is the practical advantage in shallow networks.

Still Suffers at the Extremes

For all its advantages over sigmoid, tanh shares the same fundamental problem: saturation at the extremes. When |x| is large, tanh(x) approaches ±1 and the gradient approaches 0. In a deep network, neurons that consistently receive large inputs will saturate, and their gradients will vanish. The problem is less severe than sigmoid (because the gradient peak is 1.0 vs 0.25), but it's not solved.

This is why tanh remained the preferred choice for shallow networks (2–4 layers) through the 2000s, but couldn't unlock the performance of very deep architectures. That breakthrough had to wait for ReLU.

Where Tanh Still Dominates

Even today, tanh remains the standard activation inside LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) cells. The reason is specific to recurrent networks: the values flowing through the recurrent state need to stay bounded and centered, or they explode or vanish over time steps. Tanh's bounded output range (−1, 1) provides natural stability that ReLU cannot — ReLU's unbounded positive output would allow recurrent states to grow without limit.

PropertyValueNotes
Output range(−1, 1)Zero-centered
Max gradient1.0 at x=04× stronger than sigmoid
Vanishing gradient?Yes, at extremesSame issue as sigmoid
Best useRNNs, LSTMs, shallow netsIndustry standard for RNN
04
The Workhorse · Deep Learning Revolution

ReLU

The function that unlocked truly deep networks.

When Krizhevsky, Sutskever, and Hinton published AlexNet in 2012 and won the ImageNet competition by a margin that shocked the field, one of their key design choices was deceptively simple: they used Rectified Linear Units everywhere instead of sigmoid or tanh. ReLU is, mathematically speaking, the simplest possible nonlinear function — it sets all negative values to zero and passes positive values through unchanged. And yet this trivial-looking function transformed what was possible in deep learning.

Mathematical Definition
f(x) = max(0, x) f'(x) = 1 if x > 0 f'(x) = 0 if x ≤ 0
Real-World Analogy

Think of a sales commission structure at a startup. If a salesperson makes zero or negative profit for the company this month, they earn nothing extra — zero commission. But for every dollar of profit above zero, they earn a proportional cut. There's no cap on how much they can earn for extraordinary performance — the reward scales linearly with results. This is ReLU: no reward for negative performance, proportional reward for positive performance, unlimited upside. Simple, fair, and computationally trivial to evaluate.

Why ReLU Works So Well

Three properties explain ReLU's dominance. First, it solves the vanishing gradient problem for positive inputs: the gradient is exactly 1 for all x > 0. No matter how deep the network, the gradient signal travels back through active ReLU neurons with no attenuation whatsoever. This allowed networks with 10, 50, and eventually hundreds of layers to be trained effectively for the first time.

Second, ReLU induces sparsity. In any given forward pass, roughly half of the neurons in a ReLU network output exactly zero — they are "off." This sparse representation has desirable properties: it reduces the effective model complexity, makes computations faster, and has been argued to mirror how the brain encodes information (most neurons are silent at any given moment).

Third, ReLU is computationally trivial. Evaluating max(0, x) requires one comparison and one branch. There is no exponentiation, no division. This makes ReLU orders of magnitude faster to compute than sigmoid or tanh, which matters enormously when you're applying it millions of times per second during training.

Worked Example · ReLU in an Image Classifier

A convolutional neural network detects edges in an image. After a convolution, a filter produces the following activations for a 1×5 slice: [−2.1, 0.8, 3.4, −0.3, 1.2]

After ReLU: [0, 0.8, 3.4, 0, 1.2]

Two neurons are "dead" in this pass. The network only propagates information from positions 2, 3, and 5 — where the edge signal was strong and positive. The 3.4 remains fully intact; no sigmoid squashing here.

Gradient during backprop for each position:

[0, 1, 1, 0, 1] — only positions 2, 3, 5 receive gradient. Positions 1 and 4 are blocked.

The Dying ReLU Problem

ReLU's most significant weakness is the "dying ReLU" phenomenon. A neuron "dies" when it gets stuck in a state where it always outputs 0. This happens when the weights become arranged such that the neuron's pre-activation value is negative for every input in the training dataset. In this state, the gradient is always 0, the weights never update, and the neuron is permanently frozen — it contributes nothing to the network for the rest of training.

In practice, 10–20% of neurons can die in poorly initialized or poorly configured networks. A large learning rate is a common culprit: a single aggressive gradient update can push many neurons into the permanently-off region. Proper weight initialization (He initialization is designed specifically for ReLU) and careful learning rate tuning mitigate this, but don't eliminate it. This vulnerability directly motivated Leaky ReLU and its variants.

"Using ReLUs instead of sigmoid functions is probably the single most important practical improvement to training deep networks. Not because of elegance — max(0,x) is about as inelegant as it gets — but because it just works."

Common sentiment in the deep learning community post-AlexNet

He Initialization: ReLU's Partner

Because ReLU zeros out negative inputs, it changes the effective variance of activations as you go deeper. Without correction, the variance either explodes or collapses as layers pile up. He initialization (Kaiming He, 2015) sets the initial weight variance to 2/n (where n is the number of incoming connections), exactly compensating for ReLU's zeroing. Using sigmoid-style initialization (1/n) with ReLU networks is a common mistake that degrades performance significantly.

PropertyValueNotes
Output range[0, ∞)Unbounded positive
Gradient (positive)Always 1No vanishing gradient
Gradient (negative)Always 0Dying ReLU risk
ComputationTrivialFastest activation
Zero-centered?No (non-negative)Slight optimization inefficiency
Best useHidden layers, CNNsDefault for most tasks
05
The Survivor · Fixing Dead Neurons

Leaky ReLU

ReLU with a lifeline for negative inputs.

Leaky ReLU was proposed in 2013 as a direct response to the dying neuron problem. The idea is almost embarrassingly simple: instead of clamping all negative values to exactly zero, allow a small, non-zero slope for negative inputs. A neuron that receives a negative input doesn't go silent — it passes along a tiny, attenuated signal. This tiny signal keeps the gradient alive. The neuron remains "on" in a minimal sense, and backpropagation can still update its weights.

Mathematical Definition
f(x) = x if x > 0 f(x) = αx if x ≤ 0 (α typically = 0.01) f'(x) = 1 if x > 0 f'(x) = α if x ≤ 0
Real-World Analogy

Imagine a consultant who normally earns a full day rate when they have active client work. In months with no active projects, they don't earn zero — they do small retainer work, write articles, or take training courses at a small fixed rate. They never go completely dark. They keep their skills current. When a new project arrives, they're ready to re-engage at full capacity. The small "leaky" stipend is exactly α — usually 0.01 times the normal rate — just enough to stay in the game.

The α Hyperparameter

The slope α for negative inputs is a hyperparameter. The standard value is 0.01, chosen to be small enough not to fundamentally change the function's behavior for positive inputs, but large enough to keep gradients alive. Some variants treat α as a learnable parameter (Parametric ReLU, or PReLU, proposed by Kaiming He in 2015), allowing the network to discover the optimal negative slope for each neuron independently.

Worked Example · Comparing ReLU vs Leaky ReLU

Suppose a neuron consistently receives z = −0.8 across all training examples in a batch. With α = 0.01:

Standard ReLU: f(−0.8) = 0. Gradient = 0. The weight update is: Δw = 0 × (upstream_gradient). Dead neuron.

Leaky ReLU: f(−0.8) = 0.01 × −0.8 = −0.008. Gradient = 0.01.

If the upstream gradient is −0.5: Δw = 0.01 × (−0.5) = −0.005. Small, but non-zero. The neuron nudges its weights in the right direction and can recover.

Over 1000 gradient steps: ReLU accumulates Δw = 0 (perpetually dead). Leaky ReLU accumulates Δw = −5, potentially enough to push the neuron back into positive territory and restore full activity.

ELU and SELU: Taking It Further

The Exponential Linear Unit (ELU) takes the leaking idea further by using an exponential curve for negative inputs: f(x) = α(eˣ − 1) for x ≤ 0. This produces a smooth transition at x=0 (no kink) and saturates at −α for very negative inputs. The Scaled ELU (SELU) adds a self-normalizing property: under certain conditions, SELU networks automatically maintain mean-zero and unit-variance activations through all layers, eliminating the need for batch normalization entirely.

In practice, Leaky ReLU with α=0.01 remains the most widely used variant because it's simple, computationally identical to ReLU, and doesn't introduce the exponential computation that ELU requires.

PropertyValueNotes
Output range(−∞, ∞)Fully unbounded
Gradient (positive)1Same as ReLU
Gradient (negative)α (e.g., 0.01)Neurons stay alive
Dying neurons?NoCore improvement over ReLU
ComputationSame as ReLUNo overhead
Hyperparameterα to tuneUsually just use 0.01
06
The Distributor · Multi-Class Output

Softmax

Turning raw scores into a proper probability distribution.

Softmax is fundamentally different from all other activation functions in this article. Every other function operates on a single scalar value and transforms it independently. Softmax operates on an entire vector: it takes a vector of raw scores (called logits) and transforms them into a probability distribution — a set of non-negative values that sum to exactly 1. This property makes it the natural choice for the output layer of any multi-class classification problem.

Mathematical Definition
softmax(xᵢ) = exp(xᵢ) / Σⱼ exp(xⱼ) For all i: output ∈ (0, 1), and Σᵢ output = 1 Jacobian: ∂yᵢ/∂xⱼ = yᵢ(δᵢⱼ − yⱼ) where δᵢⱼ is the Kronecker delta
Real-World Analogy

Imagine an election with five candidates. Each candidate has a raw "popularity score": some positive, some negative, some wildly different in scale. An election consultant needs to convert these messy scores into a percentage breakdown — a proper probability distribution over who will win. Softmax is exactly this conversion process: it exponentiates every score to make them all positive, then divides each by the total. The most popular candidate gets the largest slice. The least popular gets the smallest. Every candidate gets something. And the percentages sum to exactly 100%.

The Role of the Exponential

The choice of the exponential function in softmax isn't arbitrary — it has two important properties. First, exponentiation makes all values positive, regardless of whether the logits are negative. Second, it amplifies differences: a score of 3.0 vs a score of 2.0 leads to e³/e² = e ≈ 2.72× ratio in the numerators, not 1.5×. This means softmax has a "winner-takes-more" behavior: the class with the highest score gets disproportionately amplified, making the output distribution sharper and more decisive.

Worked Example · Softmax for 4-Class Classifier

An image classifier outputs logits for 4 classes: Cat=2.1, Dog=1.5, Bird=−0.3, Fish=0.8

Step 1 — Exponentiate: e²·¹=8.17, e¹·⁵=4.48, e⁻⁰·³=0.74, e⁰·⁸=2.23

Step 2 — Sum: 8.17 + 4.48 + 0.74 + 2.23 = 15.62

Step 3 — Normalize:

Cat: 8.17 / 15.62 = 0.523 (52.3%)

Dog: 4.48 / 15.62 = 0.287 (28.7%)

Bird: 0.74 / 15.62 = 0.047 (4.7%)

Fish: 2.23 / 15.62 = 0.143 (14.3%)

Sum: 0.523 + 0.287 + 0.047 + 0.143 = 1.000

Softmax + Cross-Entropy: The Standard Pairing

Softmax is almost never used alone — it's always paired with cross-entropy loss. The cross-entropy loss for a correct class c is: L = −log(softmax(x)_c). When you compute the gradient of this combined loss with respect to the logits, you get an extraordinarily clean result: ∂L/∂xᵢ = softmax(xᵢ) − 1 for the correct class, and softmax(xᵢ) for all other classes. This elegant gradient is one reason the softmax + cross-entropy combination is so widely used — the math works out beautifully.

Numerical Stability: The Log-Sum-Exp Trick

A practical caveat: naively computing exp(x) for large x overflows floating-point arithmetic. exp(1000) is infinity in IEEE 754 float32. The standard solution is to subtract the maximum logit before exponentiating: compute exp(xᵢ − max(x)) instead of exp(xᵢ). This doesn't change the output (the normalization cancels the shift), but ensures that at least one logit exponentiates to exp(0) = 1, keeping all values in a numerically stable range. All major deep learning frameworks implement this automatically.

PropertyValueNotes
InputA vector of logitsNot a scalar function
OutputProbability distributionSums to 1
Winner-takes-moreYesSharp distributions for large gaps
Used inOutput layer onlyMulti-class classification
Pair withCross-entropy lossClean, stable gradient
07
The Modern Default · LLM-Era Activation

GELU

The probabilistic gating function powering GPT, BERT, and beyond.

Introduced by Hendrycks and Gimpel in 2016, the Gaussian Error Linear Unit quietly became the most important activation function of the transformer era. While ReLU makes a hard, binary decision about each input (positive → pass through, negative → block), GELU makes a probabilistic decision: it weights each input by the probability that a Gaussian random variable would be less than that input. The result is a smooth, non-monotonic function that combines the sparsity benefits of ReLU with the smooth probabilistic intuition of sigmoid.

Mathematical Definition
GELU(x) = x · Φ(x) = x · (1/2)[1 + erf(x / √2)] Fast approximation: GELU(x) ≈ 0.5x · (1 + tanh(√(2/π) · (x + 0.044715x³))) where Φ(x) is the Gaussian CDF, erf is the error function
Real-World Analogy

Imagine a high-frequency trading algorithm deciding how much of an incoming signal to act on. For a very strong positive signal, the algorithm trusts it almost completely — the probability that such a strong signal is noise is near zero, so it acts on nearly 100% of it. For a very strong negative signal, it also rejects it almost completely. But for ambiguous signals near zero — signals that could plausibly be noise — the algorithm acts on them proportionally to how likely they are to be genuine. A signal at x=0.5 might only be acted on 60% of the way. This probability-weighted approach is exactly GELU: multiply the input by the probability that it's a meaningful signal.

Why GELU Became the Default for Transformers

The original BERT paper (2018) used GELU and reported improvements over ReLU and tanh. GPT-2 (2019) adopted GELU. GPT-3, LLaMA, PaLM, and nearly every transformer architecture since have followed suit. The empirical evidence is clear: on language tasks especially, GELU consistently outperforms ReLU and its variants.

The theoretical intuition is that language is probabilistic in nature. A token's "activation" in a neural network represents evidence for some feature. The question "how much should I activate on this evidence?" is naturally answered probabilistically — stronger evidence earns proportionally stronger activation, with no hard cutoff. GELU's soft gating matches this intuition in a way that ReLU's hard threshold does not.

Worked Example · GELU vs ReLU on Identical Inputs

Suppose a transformer attention layer output has pre-activation values: [−1.0, −0.3, 0.0, 0.5, 2.0]

ReLU output: [0, 0, 0, 0.5, 2.0]

GELU output:

GELU(−1.0) ≈ −1.0 × Φ(−1.0) = −1.0 × 0.159 ≈ −0.159

GELU(−0.3) ≈ −0.3 × Φ(−0.3) = −0.3 × 0.382 ≈ −0.115

GELU(0.0) = 0.0 × 0.5 = 0.000

GELU(0.5) ≈ 0.5 × Φ(0.5) = 0.5 × 0.691 ≈ 0.346

GELU(2.0) ≈ 2.0 × Φ(2.0) = 2.0 × 0.977 ≈ 1.954

GELU output: [−0.159, −0.115, 0.000, 0.346, 1.954]

Key difference: GELU passes small negative values with attenuated negative output. ReLU kills all negatives. GELU also attenuates positive values near zero — 0.5 becomes 0.346, not a full 0.5.

The Non-Monotonic Property

GELU is non-monotonic near x=0: it has a small dip below 0 for slightly negative values (around x = −0.17, GELU reaches its minimum of approximately −0.17). This means a slightly more negative input actually produces a slightly more negative output — which seems counterintuitive but is mathematically meaningful. This non-monotonicity has been theorized to help the model represent more complex feature interactions.

SwiGLU: GELU's Successor in Modern LLMs

The most recent large language models — LLaMA 2, Mistral, Gemma, and others — use a variant called SwiGLU (Swish-Gated Linear Unit). SwiGLU combines a gating mechanism with the Swish activation (a close relative of GELU) to create a two-stream architecture in the feedforward blocks of transformers. SwiGLU has empirically outperformed GELU on large-scale language modeling, and has become the new default in frontier models.

PropertyValueNotes
Output range(−∞, ∞)Unbounded but soft-clipped
Smooth?Yes, infinitelyNo kink at x=0 unlike ReLU
Non-monotonic?Yes, near x=0More expressive
ComputationSlower than ReLUNeeds erf or tanh approx
Standard inBERT, GPT, LLaMATransformer default

Quick Reference: When to Use Each

Binary decision · output layer
Sigmoid
Binary classification output. Gives probability between 0 and 1. Never in hidden layers.
RNNs · LSTMs · Shallow nets
Tanh
Zero-centered, bounded outputs. Standard inside LSTM and GRU recurrent cells.
CNNs · Feedforward nets
ReLU
The fast, reliable default. Use He initialization. Watch for dead neurons with high LR.
Dying neurons detected
Leaky ReLU
Drop-in ReLU replacement. Keeps neurons alive with α=0.01 negative slope.
Multi-class output layer
Softmax
Converts logit vectors to probability distributions. Always pair with cross-entropy.
Transformers · LLMs · BERT
GELU
Modern default for transformer feedforward layers. Probabilistic gating, smooth gradient.
Frontier LLMs (LLaMA, Mistral)
SwiGLU
The GELU successor. Gated linear unit structure in transformer FFN blocks.