The Quadratic Wall: Why Transformers Struggle with Length

The Bottleneck

Every transformer's core operation is self-attention: each token attends to every other token in the sequence. For a sequence of length n, this means computing an n × n attention matrix. The time complexity is O(n²d) and the memory footprint is O(n²), where d is the hidden dimension.

When n is 512 or even 2048, this is manageable. But modern LLMs are pushing context lengths to 128K, 1M, and beyond. At those scales, the quadratic cost dominates everything. Doubling the context length quadruples the compute and memory for attention. This is the wall.

What makes it worse: there's theoretical evidence (tied to the Strong Exponential Time Hypothesis) suggesting that exact self-attention cannot be computed in truly sub-quadratic time. If you want to go faster, you have to approximate or fundamentally change what attention does.

Where the Compute Actually Goes

It helps to be specific about what's expensive. Standard self-attention involves three steps:

QK^T computation — multiplying the query and key matrices to produce the n × n attention scores. This is the O(n²d) step.
Softmax — normalizing each row of the attention matrix. This is O(n²) but memory-bound rather than compute-bound.
Attention × V — multiplying the attention weights by the value matrix. Another O(n²d) step.

The attention matrix itself is the problem. It's n × n, it needs to be materialized (or at least streamed through), and every element depends on global information through the softmax normalization.

The Solution Landscape

The approaches to this problem fall into a few categories, each with different tradeoffs.

Hardware-Aware: FlashAttention

FlashAttention doesn't change the O(n²) complexity — it makes the constant factor much smaller by exploiting GPU memory hierarchies. Instead of materializing the full attention matrix in HBM (slow GPU memory), it tiles the computation and keeps intermediate results in SRAM (fast on-chip memory). The result: linear memory usage and 2–4× wall-clock speedups without any approximation.

FlashAttention-2 improved parallelism across thread blocks, and FlashAttention-3 added asynchronous execution and FP8 support on Hopper GPUs. This is now the default attention implementation in most production systems.

The limitation: the compute is still quadratic. You're doing the same number of FLOPs, just faster. Eventually, n gets large enough that even optimized quadratic compute is too slow.

Approximation: Linear Attention

The key insight behind linear attention is that the softmax attention kernel can be approximated using feature maps. Instead of computing the full n × n matrix, you compute two smaller matrices and multiply them in a different order, reducing complexity to O(n).

In practice, the approximation quality varies. Linear attention tends to struggle with tasks that require sharp, precise attention patterns — exactly the cases where the full attention matrix carries the most information. For tasks with diffuse attention (many tokens contributing roughly equally), the approximation is tighter.

Hybrid: Combining Attention with State Space Models

Mamba and its successors showed that state space models (SSMs) can match transformer performance on many tasks with linear complexity. But pure SSMs have their own weaknesses — they struggle with tasks that require precise recall over long contexts, exactly where attention excels.

The current trend is hybrid architectures that interleave attention and SSM layers. Jamba (AI21) alternates transformer, Mamba, and MoE layers. Samba combines sliding window attention with Mamba blocks. The idea: use attention where you need exact retrieval, SSMs where you need efficient long-range propagation.

These hybrids are showing strong results. Jamba matches Llama-2 70B quality with 2–7× longer effective context and 3× higher throughput. The architecture overhead of mixing layer types turns out to be worth the efficiency gain.

Rethinking Attention: Sparse and Contextual

Rather than approximating or replacing attention, some approaches make it sparse. Instead of every token attending to every other token, you identify which attention entries actually matter and compute only those.

Contextual Priority Attention (CPA) takes an interesting approach: it first computes a global context vector for the entire sequence, then uses that context to assign priority scores to individual tokens. Only high-priority tokens receive full attention. This reduces complexity to O(n log n) in theory with experimentally linear scaling.

DeepSeek's Multi-Head Latent Attention (MLA) attacks a different angle of the problem — the KV cache memory during inference. By sharing a compressed latent representation across attention heads and projecting back individually, MLA dramatically reduces the memory needed to serve long-context models without approximating the attention computation itself.

What's Actually Winning in Production

As of early 2026, the production landscape looks like:

FlashAttention is everywhere. If you're running standard transformers, you're using it.
GQA / MLA for KV cache compression during inference. Most new models use grouped-query attention at minimum.
Hybrid architectures are gaining traction for new model training, especially where long context is a primary requirement.
Pure linear alternatives remain mostly in research. The quality gap on retrieval-heavy tasks is still real.

The honest summary: no single architecture has definitively replaced quadratic attention across all tasks. The field is converging on modular, hybrid designs where different layers use different mechanisms depending on what that layer needs to do. Full attention for precise retrieval, SSMs for efficient propagation, sparse attention where the pattern is predictable.

The Bigger Picture

The quadratic bottleneck forced the field to ask a good question: does every token really need to attend to every other token? For most practical tasks, the answer is no. Attention matrices are typically sparse — most entries are near-zero after softmax. We're doing O(n²) work to discover that most of it doesn't matter.

The architectures that will win long-term are the ones that figure out which interactions matter without computing all of them first. That's a harder problem than it sounds, but it's the right one to solve.