Every neuron in a deep neural network performs two operations: a weighted sum of its inputs, and then a transformation of that sum. That second step — the transformation — is the job of the activation function. It sounds simple. But the choice of activation function is one of the most consequential decisions in the architecture of a neural network.
Without activation functions, a neural network — no matter how many layers it has — is nothing more than a linear regression model. Every layer would simply multiply its input by a matrix, and the composition of a thousand linear transformations is still just one linear transformation. The non-linearity introduced by activation functions is what allows networks to approximate any function, learn curved decision boundaries, and ultimately recognize faces, translate languages, and generate text.
This article covers seven activation functions in depth: where they come from, why they work, where they fail, and how to choose between them. Each is explained with a real-world analogy, a worked numerical example, and the mathematical derivation of its gradient — the quantity that makes learning possible.
The step function — also called the Heaviside function — is the oldest activation function in the history of neural networks. It was introduced as part of the McCulloch-Pitts neuron model in 1943, inspired by the biological neuron's behavior: either it fires or it doesn't. This binary, all-or-nothing model was the first attempt to mathematically capture how real neurons work.
Real-World Analogy
Think of a light switch. You flip it up — the light is fully on. You flip it down — the light is fully off. There is no "dimming." There is no degree of onness. The switch doesn't know how far above the threshold you pushed it; it only knows that you did. This is precisely what the step function does: it transforms any weighted sum above zero into a full "1" output and anything below into a flat "0".
How It Works
Imagine a neuron that receives three inputs: x₁ = 0.8 (how bright is the light), x₂ = 0.3 (how close is the object), x₃ = -0.5 (is there noise?). The weights are w₁ = 0.6, w₂ = 0.4, w₃ = 0.2, and the bias b = -0.3. The weighted sum z is:
Inputs: x₁=0.8, x₂=0.3, x₃=−0.5 | Weights: w₁=0.6, w₂=0.4, w₃=0.2 | Bias: b=−0.3
z = (0.8×0.6) + (0.3×0.4) + (−0.5×0.2) + (−0.3)
z = 0.48 + 0.12 − 0.10 − 0.30 = 0.20
Since z = 0.20 ≥ 0 → f(z) = 1. The neuron fires.
If the bias were −0.5 instead: z = 0.00 → still fires (x≥0 rule).
If the bias were −0.6: z = −0.10 → f(z) = 0. The neuron doesn't fire.
The Fatal Flaw: Zero Gradient
The step function's output is either 0 or 1 — everywhere. This means its derivative is 0 everywhere (and undefined at x=0). In deep learning, we train networks using backpropagation: we compute how much each weight contributed to the error, and nudge the weights in the right direction. That computation requires the gradient of the activation function.
If the gradient is always zero, backpropagation has nothing to work with. No matter how wrong the network's output is, the gradient signal cannot travel backward through a step function. The weights never update. The network never learns. This isn't a minor inconvenience — it's a complete breakdown of the learning algorithm.
"The step function introduced the idea of threshold-based neural firing. But it gave us a network that could represent decisions yet could never learn to make better ones."
On the history of activation functions in connectionism
Historical Significance
Despite its unusability in modern deep learning, the step function was genuinely important. It established the idea that a neuron's output could be a nonlinear function of its inputs. It gave rise to the perceptron algorithm in the 1950s and directly motivated the search for differentiable alternatives — which led to the sigmoid function a few decades later. Every activation function that follows in this article is, in some sense, a response to the step function's limitations.
| Property | Value | Notes |
| Output range | {0, 1} | Binary only |
| Differentiable? | No | FATAL for backprop |
| Vanishing gradient? | Gradient is always 0 | Cannot train |
| Zero-centered? | No (outputs 0 or 1) | Biased updates |
| Use today? | Never in hidden layers | Only for concept illustration |
The sigmoid function emerged as the natural differentiable alternative to the step function. If we want a neuron that behaves like an on/off switch but can also be trained by backpropagation, we need something that looks like a smooth, continuous version of the step function — and that's exactly what the sigmoid delivers. Its characteristic S-shaped curve transitions smoothly from 0 to 1, with the steepest slope at x=0 and gentle saturation at both extremes.
Real-World Analogy
Picture a hospital triage nurse deciding whether to escalate a patient to the ICU. A patient with no symptoms at all (x → −∞) has near-zero probability of needing escalation. A patient in cardiac arrest (x → +∞) has near-certainty. But for the vast middle ground — the ambiguous cases — the nurse weighs dozens of signals and returns a probability. Small changes in these middle-ground cases dramatically change the output; extreme cases hardly change at all. The sigmoid encodes exactly this intuition: sensitivity in the uncertain middle, certainty at the extremes.
The Elegant Gradient
One of the reasons the sigmoid became so popular is that its derivative has a beautiful closed-form expression. If you already know σ(x), computing σ'(x) is free — you just multiply σ(x) by (1 − σ(x)). This made backpropagation implementations clean and efficient in an era when computational resources were precious.
Consider a binary classifier predicting whether an email is spam. The last layer outputs a raw score z = 2.1.
σ(2.1) = 1 / (1 + e⁻²·¹) = 1 / (1 + 0.1225) = 1 / 1.1225 ≈ 0.891
The model outputs 89.1% probability of spam. If the true label is "spam" (y=1), the gradient of the loss with respect to the pre-activation score is simply σ(z) − y = 0.891 − 1 = −0.109.
Now for a neutral email, z = −0.4:
σ(−0.4) = 1 / (1 + e⁰·⁴) = 1 / (1 + 1.492) ≈ 0.401
40.1% probability of spam — correctly uncertain.
The Vanishing Gradient Problem
The sigmoid's maximum gradient is 0.25 — reached only at x=0. For |x| > 2, the gradient is already below 0.05. For |x| > 4, it's below 0.002. In a deep network, the backward pass multiplies these gradients together across layers. In a 10-layer network where every neuron is saturated, the gradient reaching the first layer can be (0.002)¹⁰ = 10⁻³⁰. The weights in early layers effectively stop updating — they are frozen by mathematics, not design.
This phenomenon, called the vanishing gradient problem, was a primary reason deep networks were considered untrainable through the 1990s and early 2000s. It was only addressed when ReLU arrived and residual connections were introduced.
Non-Zero-Centered Outputs
A subtler problem: sigmoid outputs are always positive (between 0 and 1). When these outputs feed into the next layer as inputs, all gradients for the next layer's weights will have the same sign. This forces weight updates to be either all positive or all negative simultaneously — a "zig-zag" dynamics during optimization that slows convergence. Tanh addresses exactly this issue by centering its outputs around 0.
| Property | Value | Notes |
| Output range | (0, 1) | Probability interpretation |
| Differentiable? | Yes, everywhere | Smooth gradient |
| Max gradient | 0.25 at x=0 | Vanishing in deep nets |
| Zero-centered? | No (0 to 1) | Zig-zag updates |
| Best use | Output layer only | Binary classification |
The hyperbolic tangent function — tanh — emerged as a direct improvement over sigmoid by solving one of its main problems: non-zero-centered outputs. While sigmoid outputs values between 0 and 1, tanh outputs values between -1 and 1, with the mean of the distribution centered at 0. This seemingly small change has significant consequences for how efficiently a network can learn.
Mathematically, tanh is actually a scaled and shifted version of sigmoid: tanh(x) = 2·σ(2x) − 1. The shape is identical — an S-curve — but stretched vertically to span from −1 to +1 instead of 0 to 1. The maximum gradient of tanh is 1.0 (four times higher than sigmoid's 0.25), meaning gradient information can travel further back through the network before vanishing.
Real-World Analogy
Imagine a film critic rating movies on a scale from −1 (complete disaster) to +1 (masterpiece), with 0 meaning "perfectly average." A critic who has seen 10,000 films rarely goes to extremes — a film has to be extraordinarily bad or extraordinarily good to move them far from center. Most films cluster around the middle. This is exactly how tanh behaves: it expresses strong opinions only when the evidence is overwhelming; otherwise, it returns a nuanced, centered signal.
Why Zero-Centering Matters
Consider what happens when a layer's outputs feed into the next layer's weights. If every input to a weight is positive (as with sigmoid), then the gradient with respect to that weight is either always positive or always negative (depending on the upstream gradient). This means all weights in a given neuron update in the same direction simultaneously. The optimizer can't decrease some weights while increasing others — it has to do a series of zig-zagging steps. Zero-centered inputs from tanh allow weight updates to have mixed signs, enabling more direct paths through the loss landscape.
A sentiment analysis network processes the phrase "the movie was not bad." The hidden layer neuron receives z = −1.5 (a moderately negative signal before tanh):
tanh(−1.5) = (e⁻¹·⁵ − e¹·⁵) / (e⁻¹·⁵ + e¹·⁵) = (0.223 − 4.482) / (0.223 + 4.482) ≈ −0.905
The gradient for backprop: tanh'(−1.5) = 1 − (−0.905)² = 1 − 0.819 = 0.181
Now for a weakly positive signal z = 0.3:
tanh(0.3) ≈ 0.291 and tanh'(0.3) = 1 − 0.291² ≈ 0.915
Notice how the gradient 0.915 is nearly 5× larger than the sigmoid's maximum gradient. This is the practical advantage in shallow networks.
Still Suffers at the Extremes
For all its advantages over sigmoid, tanh shares the same fundamental problem: saturation at the extremes. When |x| is large, tanh(x) approaches ±1 and the gradient approaches 0. In a deep network, neurons that consistently receive large inputs will saturate, and their gradients will vanish. The problem is less severe than sigmoid (because the gradient peak is 1.0 vs 0.25), but it's not solved.
This is why tanh remained the preferred choice for shallow networks (2–4 layers) through the 2000s, but couldn't unlock the performance of very deep architectures. That breakthrough had to wait for ReLU.
Where Tanh Still Dominates
Even today, tanh remains the standard activation inside LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) cells. The reason is specific to recurrent networks: the values flowing through the recurrent state need to stay bounded and centered, or they explode or vanish over time steps. Tanh's bounded output range (−1, 1) provides natural stability that ReLU cannot — ReLU's unbounded positive output would allow recurrent states to grow without limit.
| Property | Value | Notes |
| Output range | (−1, 1) | Zero-centered |
| Max gradient | 1.0 at x=0 | 4× stronger than sigmoid |
| Vanishing gradient? | Yes, at extremes | Same issue as sigmoid |
| Best use | RNNs, LSTMs, shallow nets | Industry standard for RNN |
When Krizhevsky, Sutskever, and Hinton published AlexNet in 2012 and won the ImageNet competition by a margin that shocked the field, one of their key design choices was deceptively simple: they used Rectified Linear Units everywhere instead of sigmoid or tanh. ReLU is, mathematically speaking, the simplest possible nonlinear function — it sets all negative values to zero and passes positive values through unchanged. And yet this trivial-looking function transformed what was possible in deep learning.
Real-World Analogy
Think of a sales commission structure at a startup. If a salesperson makes zero or negative profit for the company this month, they earn nothing extra — zero commission. But for every dollar of profit above zero, they earn a proportional cut. There's no cap on how much they can earn for extraordinary performance — the reward scales linearly with results. This is ReLU: no reward for negative performance, proportional reward for positive performance, unlimited upside. Simple, fair, and computationally trivial to evaluate.
Why ReLU Works So Well
Three properties explain ReLU's dominance. First, it solves the vanishing gradient problem for positive inputs: the gradient is exactly 1 for all x > 0. No matter how deep the network, the gradient signal travels back through active ReLU neurons with no attenuation whatsoever. This allowed networks with 10, 50, and eventually hundreds of layers to be trained effectively for the first time.
Second, ReLU induces sparsity. In any given forward pass, roughly half of the neurons in a ReLU network output exactly zero — they are "off." This sparse representation has desirable properties: it reduces the effective model complexity, makes computations faster, and has been argued to mirror how the brain encodes information (most neurons are silent at any given moment).
Third, ReLU is computationally trivial. Evaluating max(0, x) requires one comparison and one branch. There is no exponentiation, no division. This makes ReLU orders of magnitude faster to compute than sigmoid or tanh, which matters enormously when you're applying it millions of times per second during training.
A convolutional neural network detects edges in an image. After a convolution, a filter produces the following activations for a 1×5 slice: [−2.1, 0.8, 3.4, −0.3, 1.2]
After ReLU: [0, 0.8, 3.4, 0, 1.2]
Two neurons are "dead" in this pass. The network only propagates information from positions 2, 3, and 5 — where the edge signal was strong and positive. The 3.4 remains fully intact; no sigmoid squashing here.
Gradient during backprop for each position:
[0, 1, 1, 0, 1] — only positions 2, 3, 5 receive gradient. Positions 1 and 4 are blocked.
The Dying ReLU Problem
ReLU's most significant weakness is the "dying ReLU" phenomenon. A neuron "dies" when it gets stuck in a state where it always outputs 0. This happens when the weights become arranged such that the neuron's pre-activation value is negative for every input in the training dataset. In this state, the gradient is always 0, the weights never update, and the neuron is permanently frozen — it contributes nothing to the network for the rest of training.
In practice, 10–20% of neurons can die in poorly initialized or poorly configured networks. A large learning rate is a common culprit: a single aggressive gradient update can push many neurons into the permanently-off region. Proper weight initialization (He initialization is designed specifically for ReLU) and careful learning rate tuning mitigate this, but don't eliminate it. This vulnerability directly motivated Leaky ReLU and its variants.
"Using ReLUs instead of sigmoid functions is probably the single most important practical improvement to training deep networks. Not because of elegance — max(0,x) is about as inelegant as it gets — but because it just works."
Common sentiment in the deep learning community post-AlexNet
He Initialization: ReLU's Partner
Because ReLU zeros out negative inputs, it changes the effective variance of activations as you go deeper. Without correction, the variance either explodes or collapses as layers pile up. He initialization (Kaiming He, 2015) sets the initial weight variance to 2/n (where n is the number of incoming connections), exactly compensating for ReLU's zeroing. Using sigmoid-style initialization (1/n) with ReLU networks is a common mistake that degrades performance significantly.
| Property | Value | Notes |
| Output range | [0, ∞) | Unbounded positive |
| Gradient (positive) | Always 1 | No vanishing gradient |
| Gradient (negative) | Always 0 | Dying ReLU risk |
| Computation | Trivial | Fastest activation |
| Zero-centered? | No (non-negative) | Slight optimization inefficiency |
| Best use | Hidden layers, CNNs | Default for most tasks |
Leaky ReLU was proposed in 2013 as a direct response to the dying neuron problem. The idea is almost embarrassingly simple: instead of clamping all negative values to exactly zero, allow a small, non-zero slope for negative inputs. A neuron that receives a negative input doesn't go silent — it passes along a tiny, attenuated signal. This tiny signal keeps the gradient alive. The neuron remains "on" in a minimal sense, and backpropagation can still update its weights.
Real-World Analogy
Imagine a consultant who normally earns a full day rate when they have active client work. In months with no active projects, they don't earn zero — they do small retainer work, write articles, or take training courses at a small fixed rate. They never go completely dark. They keep their skills current. When a new project arrives, they're ready to re-engage at full capacity. The small "leaky" stipend is exactly α — usually 0.01 times the normal rate — just enough to stay in the game.
The α Hyperparameter
The slope α for negative inputs is a hyperparameter. The standard value is 0.01, chosen to be small enough not to fundamentally change the function's behavior for positive inputs, but large enough to keep gradients alive. Some variants treat α as a learnable parameter (Parametric ReLU, or PReLU, proposed by Kaiming He in 2015), allowing the network to discover the optimal negative slope for each neuron independently.
Suppose a neuron consistently receives z = −0.8 across all training examples in a batch. With α = 0.01:
Standard ReLU: f(−0.8) = 0. Gradient = 0. The weight update is: Δw = 0 × (upstream_gradient). Dead neuron.
Leaky ReLU: f(−0.8) = 0.01 × −0.8 = −0.008. Gradient = 0.01.
If the upstream gradient is −0.5: Δw = 0.01 × (−0.5) = −0.005. Small, but non-zero. The neuron nudges its weights in the right direction and can recover.
Over 1000 gradient steps: ReLU accumulates Δw = 0 (perpetually dead). Leaky ReLU accumulates Δw = −5, potentially enough to push the neuron back into positive territory and restore full activity.
ELU and SELU: Taking It Further
The Exponential Linear Unit (ELU) takes the leaking idea further by using an exponential curve for negative inputs: f(x) = α(eˣ − 1) for x ≤ 0. This produces a smooth transition at x=0 (no kink) and saturates at −α for very negative inputs. The Scaled ELU (SELU) adds a self-normalizing property: under certain conditions, SELU networks automatically maintain mean-zero and unit-variance activations through all layers, eliminating the need for batch normalization entirely.
In practice, Leaky ReLU with α=0.01 remains the most widely used variant because it's simple, computationally identical to ReLU, and doesn't introduce the exponential computation that ELU requires.
| Property | Value | Notes |
| Output range | (−∞, ∞) | Fully unbounded |
| Gradient (positive) | 1 | Same as ReLU |
| Gradient (negative) | α (e.g., 0.01) | Neurons stay alive |
| Dying neurons? | No | Core improvement over ReLU |
| Computation | Same as ReLU | No overhead |
| Hyperparameter | α to tune | Usually just use 0.01 |
Softmax is fundamentally different from all other activation functions in this article. Every other function operates on a single scalar value and transforms it independently. Softmax operates on an entire vector: it takes a vector of raw scores (called logits) and transforms them into a probability distribution — a set of non-negative values that sum to exactly 1. This property makes it the natural choice for the output layer of any multi-class classification problem.
Real-World Analogy
Imagine an election with five candidates. Each candidate has a raw "popularity score": some positive, some negative, some wildly different in scale. An election consultant needs to convert these messy scores into a percentage breakdown — a proper probability distribution over who will win. Softmax is exactly this conversion process: it exponentiates every score to make them all positive, then divides each by the total. The most popular candidate gets the largest slice. The least popular gets the smallest. Every candidate gets something. And the percentages sum to exactly 100%.
The Role of the Exponential
The choice of the exponential function in softmax isn't arbitrary — it has two important properties. First, exponentiation makes all values positive, regardless of whether the logits are negative. Second, it amplifies differences: a score of 3.0 vs a score of 2.0 leads to e³/e² = e ≈ 2.72× ratio in the numerators, not 1.5×. This means softmax has a "winner-takes-more" behavior: the class with the highest score gets disproportionately amplified, making the output distribution sharper and more decisive.
An image classifier outputs logits for 4 classes: Cat=2.1, Dog=1.5, Bird=−0.3, Fish=0.8
Step 1 — Exponentiate: e²·¹=8.17, e¹·⁵=4.48, e⁻⁰·³=0.74, e⁰·⁸=2.23
Step 2 — Sum: 8.17 + 4.48 + 0.74 + 2.23 = 15.62
Step 3 — Normalize:
Cat: 8.17 / 15.62 = 0.523 (52.3%)
Dog: 4.48 / 15.62 = 0.287 (28.7%)
Bird: 0.74 / 15.62 = 0.047 (4.7%)
Fish: 2.23 / 15.62 = 0.143 (14.3%)
Sum: 0.523 + 0.287 + 0.047 + 0.143 = 1.000 ✓
Softmax + Cross-Entropy: The Standard Pairing
Softmax is almost never used alone — it's always paired with cross-entropy loss. The cross-entropy loss for a correct class c is: L = −log(softmax(x)_c). When you compute the gradient of this combined loss with respect to the logits, you get an extraordinarily clean result: ∂L/∂xᵢ = softmax(xᵢ) − 1 for the correct class, and softmax(xᵢ) for all other classes. This elegant gradient is one reason the softmax + cross-entropy combination is so widely used — the math works out beautifully.
Numerical Stability: The Log-Sum-Exp Trick
A practical caveat: naively computing exp(x) for large x overflows floating-point arithmetic. exp(1000) is infinity in IEEE 754 float32. The standard solution is to subtract the maximum logit before exponentiating: compute exp(xᵢ − max(x)) instead of exp(xᵢ). This doesn't change the output (the normalization cancels the shift), but ensures that at least one logit exponentiates to exp(0) = 1, keeping all values in a numerically stable range. All major deep learning frameworks implement this automatically.
| Property | Value | Notes |
| Input | A vector of logits | Not a scalar function |
| Output | Probability distribution | Sums to 1 |
| Winner-takes-more | Yes | Sharp distributions for large gaps |
| Used in | Output layer only | Multi-class classification |
| Pair with | Cross-entropy loss | Clean, stable gradient |
Introduced by Hendrycks and Gimpel in 2016, the Gaussian Error Linear Unit quietly became the most important activation function of the transformer era. While ReLU makes a hard, binary decision about each input (positive → pass through, negative → block), GELU makes a probabilistic decision: it weights each input by the probability that a Gaussian random variable would be less than that input. The result is a smooth, non-monotonic function that combines the sparsity benefits of ReLU with the smooth probabilistic intuition of sigmoid.
Real-World Analogy
Imagine a high-frequency trading algorithm deciding how much of an incoming signal to act on. For a very strong positive signal, the algorithm trusts it almost completely — the probability that such a strong signal is noise is near zero, so it acts on nearly 100% of it. For a very strong negative signal, it also rejects it almost completely. But for ambiguous signals near zero — signals that could plausibly be noise — the algorithm acts on them proportionally to how likely they are to be genuine. A signal at x=0.5 might only be acted on 60% of the way. This probability-weighted approach is exactly GELU: multiply the input by the probability that it's a meaningful signal.
Why GELU Became the Default for Transformers
The original BERT paper (2018) used GELU and reported improvements over ReLU and tanh. GPT-2 (2019) adopted GELU. GPT-3, LLaMA, PaLM, and nearly every transformer architecture since have followed suit. The empirical evidence is clear: on language tasks especially, GELU consistently outperforms ReLU and its variants.
The theoretical intuition is that language is probabilistic in nature. A token's "activation" in a neural network represents evidence for some feature. The question "how much should I activate on this evidence?" is naturally answered probabilistically — stronger evidence earns proportionally stronger activation, with no hard cutoff. GELU's soft gating matches this intuition in a way that ReLU's hard threshold does not.
Suppose a transformer attention layer output has pre-activation values: [−1.0, −0.3, 0.0, 0.5, 2.0]
ReLU output: [0, 0, 0, 0.5, 2.0]
GELU output:
GELU(−1.0) ≈ −1.0 × Φ(−1.0) = −1.0 × 0.159 ≈ −0.159
GELU(−0.3) ≈ −0.3 × Φ(−0.3) = −0.3 × 0.382 ≈ −0.115
GELU(0.0) = 0.0 × 0.5 = 0.000
GELU(0.5) ≈ 0.5 × Φ(0.5) = 0.5 × 0.691 ≈ 0.346
GELU(2.0) ≈ 2.0 × Φ(2.0) = 2.0 × 0.977 ≈ 1.954
GELU output: [−0.159, −0.115, 0.000, 0.346, 1.954]
Key difference: GELU passes small negative values with attenuated negative output. ReLU kills all negatives. GELU also attenuates positive values near zero — 0.5 becomes 0.346, not a full 0.5.
The Non-Monotonic Property
GELU is non-monotonic near x=0: it has a small dip below 0 for slightly negative values (around x = −0.17, GELU reaches its minimum of approximately −0.17). This means a slightly more negative input actually produces a slightly more negative output — which seems counterintuitive but is mathematically meaningful. This non-monotonicity has been theorized to help the model represent more complex feature interactions.
SwiGLU: GELU's Successor in Modern LLMs
The most recent large language models — LLaMA 2, Mistral, Gemma, and others — use a variant called SwiGLU (Swish-Gated Linear Unit). SwiGLU combines a gating mechanism with the Swish activation (a close relative of GELU) to create a two-stream architecture in the feedforward blocks of transformers. SwiGLU has empirically outperformed GELU on large-scale language modeling, and has become the new default in frontier models.
| Property | Value | Notes |
| Output range | (−∞, ∞) | Unbounded but soft-clipped |
| Smooth? | Yes, infinitely | No kink at x=0 unlike ReLU |
| Non-monotonic? | Yes, near x=0 | More expressive |
| Computation | Slower than ReLU | Needs erf or tanh approx |
| Standard in | BERT, GPT, LLaMA | Transformer default |