Beyond Residual Connections: Rethinking the Vanishing Gradient Problem
Residual connections solved vanishing gradients for transformers. But what if we revisited the problem with a different lens — pseudo-normalization offers a surprisingly simple alternative.
4 min read
deep-learningoptimizationgradients