Blog

Writing about machine learning, data science, and engineering.

Beyond Residual Connections: Rethinking the Vanishing Gradient Problem

Mar 8, 2026

Residual connections solved vanishing gradients for transformers. But what if we revisited the problem with a different lens — pseudo-normalization offers a surprisingly simple alternative.

4 min read

deep-learningoptimizationgradients

The Quadratic Wall: Why Transformers Struggle with Length

Mar 6, 2026

Self-attention scales as O(n²) with sequence length. As context windows push into the millions, this bottleneck is reshaping how we think about sequence modeling.

5 min read

transformersattentionefficiency