Back to blog

Beyond Residual Connections: Rethinking the Vanishing Gradient Problem

·4 min read
deep-learningoptimizationgradients

The Problem That Shaped Modern Architectures

If you've trained anything deeper than a few layers, you've run into vanishing gradients. The mechanism is straightforward: during backpropagation, gradients are multiplied through each layer. When activation functions like sigmoid or tanh produce derivatives in the range (0, 1), those multiplications compound. By the time you reach the early layers of a deep network, the gradient signal has decayed to near-zero. The network's early layers barely learn.

This isn't just a historical footnote — it's the reason modern deep learning looks the way it does. ReLU became dominant specifically because its gradient is either 0 or 1 in the positive region, sidestepping the multiplicative decay. Batch normalization was introduced partly to keep activations from drifting into saturated regimes where gradients collapse. And residual connections — the backbone of both ResNets and transformers — solve the problem by providing gradient highways that bypass the vanishing multiplication chain entirely.

These solutions work. But they also constrain what architectures we build. We've collectively moved away from smooth, bounded activation functions because they seemed incompatible with depth. That's worth questioning.

Pseudo-Normalization: A Different Approach

A recent paper in Entropy by Bu et al. proposes something interesting: what if instead of avoiding activations that cause vanishing gradients, we periodically amplify the gradients that are vanishing?

The idea is called pseudo-normalization. Every few layers, you divide the gradient by its root mean square (RMS). If the gradient magnitudes have been shrinking through successive multiplications, this division by a small RMS value scales them back up — pushing amplitudes above one and preventing the exponential decay.

The formulation is straightforward. Given a gradient vector, the pseudo-normalization operator computes the RMS across its components and divides through. It's not full normalization in the batch-norm sense — there's no learnable scale/shift, no running statistics. It's a lightweight correction applied periodically during backpropagation to keep the gradient signal alive.

What makes this compelling is what it enables: training deep networks with tanh activations that actually converge. The authors demonstrate this on image classification tasks, achieving reasonable performance with architectures that would normally be untrainable at depth using bounded activations.

Why This Matters Beyond the Obvious

You might ask: why bother with tanh when ReLU works fine? A few reasons are worth considering.

Smoothness has theoretical advantages. ReLU is piecewise linear — its second derivative is zero everywhere (or undefined at zero). This limits the class of functions a network can efficiently approximate in certain regimes. Smooth activations like tanh have well-behaved higher-order derivatives, which matters for optimization landscapes and certain theoretical guarantees.

ReLU has real failure modes. The "dying ReLU" problem — where neurons get stuck outputting zero and never recover — is well-documented. Leaky ReLU and its variants patch this, but they're patches. ReLU also produces unbounded positive outputs, which can cause internal covariate shift and requires careful initialization to manage.

Architectural flexibility. If pseudo-normalization makes bounded activations viable at depth, it opens design space. You're no longer locked into the ReLU family just to make training work. That freedom to choose activation functions based on the problem rather than the optimization constraint is valuable.

Connection to What Transformers Already Do

The interesting thing is that transformers already use a closely related idea. RMSNorm — now standard in architectures like LLaMA — normalizes hidden states by dividing by their root mean square. It's a simplified version of layer normalization that drops the mean-centering step.

Pseudo-normalization applies a similar operation, but targeted at the gradient signal during backpropagation rather than the forward activations. You could view it as RMSNorm's mirror image: one stabilizes the forward pass, the other stabilizes the backward pass.

This symmetry suggests something deeper about normalization in deep networks. We've spent a decade refining how we normalize activations (batch norm, layer norm, group norm, RMSNorm). Maybe there's equivalent ground to cover in normalizing gradients directly.

The Tradeoff

The authors are upfront about the cost: smooth activation functions are more expensive to compute than ReLU. For most production workloads where ReLU-family activations work well, there's no reason to switch. The value is in the cases where you specifically want or need bounded, smooth activations — certain physics-informed networks, architectures with specific stability requirements, or research settings exploring activation function design.

The bigger takeaway isn't about replacing ReLU. It's that the vanishing gradient problem may have more solutions than we've settled on, and some of them are surprisingly simple.


Reference: Bu, Y.; Jiang, W.; Lu, G.; Zhang, Q. Mitigating the Vanishing Gradient Problem Using a Pseudo-Normalizing Method. Entropy 2026, 28(1), 57.