A unifying view of linear attention (part 2) • Erik Steiger

1. Recap

Part 1 showed that linear attention and the other sub-quadratic models we discussed all fit

S_t = S_{t-1} A_t + b_t k_t^\top

with different choices in $(A_t, b_t)$ . Softmax could be seen as a special form with an infinite state, or rather practically our “memory” is the entire growing KV cache, which gives us perfect recall using quadratic compute. If we pick a finite-dim feature map $\phi$ we get linear attention as a compressed form of softmax with fixed-size state $S_t = \sum_{j \le t} v_j \phi(k_j)^\top$ in matrix form. We accept lossy recall to escape quadratic compute.

We also showed that the state $S$ with size $d_v \times d_k$ holds at most $n \lesssim d_k$ near-orthogonal $(k, v)$ pairs cleanly and above that the cross-talk between the vectors drowns the signal.

The recurrence of DeltaNet is one gradient step on $L_t(S) = \tfrac{1}{2}\|v_t - S k_t\|^2$ giving us

S_t = S_{t-1}(I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top.

The main benefit is that seeing the same key twice doesn’t grow the associated value, and already stored orthogonal keys are kept. The capacity ceiling is still unchanged but the budget is used more cleanly.

Gating is introduced to exponentially decay the state $S$ via Mamba2’s $A_t = \alpha_t I$ and Gated DeltaNet’s $A_t = \alpha_t (I - \beta_t k_t k_t^\top)$ . This doesn’t raise the capacity ceiling, but it lets us recycle the $d_k$ slots by decaying old writes so new ones can take their place.

The state was always a matrix $S$ and the update rule for DeltaNet followed from doing one step of gradient descent on $L_t(S) = \tfrac{1}{2}\|v_t - S k_t\|^2$ . What if $S$ is instead a neural network with multiple layers? That’s the step that TTT and Titans take and build upon.

2. Test-time training (TTT)

Reframe the state update — sometimes referenced as inner loop — as one GD step on a parameterized inner model:

L_t(\theta) = \tfrac{1}{2}\|v_t - M_\theta(k_t)\|^2

\theta_t = \theta_{t-1} - \beta_t \nabla_\theta L_t(\theta_{t-1})

For $M_\theta(k) = S k$ with $\theta = S$ we get what we had before:

$\nabla_\theta L_t \big|_{\theta_{t-1}} = -(v_t - S_{t-1} k_t)\, k_t^\top$
$S_t = S_{t-1} + \beta_t (v_t - S_{t-1} k_t)\, k_t^\top$

This is DeltaNet’s (Yang et al., 2024) update rule from the last part and also called TTT-Linear (Sun et al., 2024) and it shows DeltaNet wasn’t a choice of a specific recurrence, it rather follows if we do one GD step on $\tfrac{1}{2}\|v - M_\theta(k)\|^2$ with $M_\theta$ being a single linear layer.

(Sun et al., 2024) also introduces a second variant TTT-MLP with a richer $M$ : a two-layer MLP $M_\theta(k) = W_2 \sigma(W_1 k)$ , $\theta = (W_1, W_2)$ , where we can also derive the gradients explicitly. Let $u = W_1 k_t$ , $h = \sigma(u)$ , $f = W_2 h$ , $r = v_t - f$ :

\nabla_{W_2} L_t = -r\, h^\top

\nabla_{W_1} L_t = -\big[W_2^\top r \;\odot\; \sigma'(u)\big]\, k_t^\top

An explicit update rule for each weight matrix.

A note on the name.

In my opinion “test-time training” is misleading. What we have is an inner-loop state-update rule applied during forward inference, and not training in the conventional sense. There’s no held-out set, and no “done” condition. The outer parameters $(W_q, W_k, W_v, W_1, W_2)$ are trained once on a training set, the normal way, and are not changed during inference. What runs at inference is a per-token state update of what we can casually refer to as “memory”.

TTT-MLP doesn’t fit our recurrence form.

So far our general form $S_t = S_{t-1} A_t + b_t k_t^\top$ rested on two assumptions

the state is a single matrix
the readout is linear in the state ( $S k_q$ — a single matrix-vector product).

TTT-MLP breaks both. By having $M_\theta(k_q) = W_2 \sigma(W_1 k_q)$ with a nonlinearity, there’s no $A_t$ that captures that. Our general form from Part 1 stops being expressive enough, but we will define a new general form across four axes.

3. Titans

Titans (Behrouz et al., 2024) extends TTT-MLP with two textbook optimizer ingredients: momentum and weight decay. Because of momentum our state is now the pair $(\theta_t, m_t)$ , and the update rule reads

m_t = \nu_t\, m_{t-1} + \nabla_\theta L_t(\theta_{t-1})

\theta_t = \alpha_t\, \theta_{t-1} - \beta_t\, m_t.

Note: the Titans paper writes $\theta_t$ for the momentum decay scalar (not the parameters!) and $\mathcal{S}_t$ for the momentum buffer. To keep $\theta$ as parameters across the post, we use $\nu_t$ for momentum decay and $m_t$ for the buffer (Adam-style first-moment convention). The Titans paper also writes the retention as $(1 - \alpha_t)$ ; we follow Mamba2 / Gated DeltaNet and use $\alpha_t$ directly so $\alpha_t = 1$ means full retention across all gated models.

$m_t$ is an EMA of recent gradient directions (first moment), which acts as a low-pass on write directions (not on what’s been written).

$\alpha_t$ is scalar retention on $\theta_{t-1}$ , structurally identical to Gated DeltaNet’s retention. Same idea as in Part 1, applied to all MLP parameters instead of a single matrix.

Both moves are standard SGD ingredients (momentum, weight decay) ported into the memory state update step. As this is just SGD on the memory state, anything we know about training neural nets is reusable here. We’ll see further such techniques later in this post.

4. MIRAS

We moved from the linear recurrence $S_t = S_{t-1} A_t + b_t k_t^\top$ of a matrix to the general case where we optimise the parameters $\theta$ of some model $M_\theta$ on an inner objective $L_t$ that matches $v \approx M_\theta(k)$ , with optional retention on $\theta$ . This gives us four design choices: (1) memory architecture $M_\theta(k)$ , (2) inner objective $L_t(\theta)$ , (3) inner optimizer, (4) retention on $\theta$ . We take them in turn.

Axis 1 — memory architecture $M_\theta(k)$ .

Instead of having a single matrix $S$ like before we generalise our memory architecture to any model $M$ with parameters $\theta$ . We have seen already these special cases:

Linear: $M_\theta(k) = \theta k$ — linear attention, DeltaNet, Gated DeltaNet.
2-layer MLP: $M_\theta(k) = W_2 \sigma(W_1 k)$ — TTT-MLP, Titans.

This also includes all variants that use kernel mappings $M_\theta(\phi(k))$ to increase the capacity of the state (Zhong et al., 2025) .

Axis 2 — memory objective $L_t(\theta)$ .

Next we define the loss which $M_\theta$ is optimised on. In the general case we want $M_\theta(k)$ to reconstruct its associated value $v$ .

For the simple case $M_\theta(k) = \theta k$ , two objectives (dot-product, L2) generate two well-known models from the same one-step-GD recipe:

L_t(\theta) = -v_t^\top M_\theta(k_t) \;\;\Longrightarrow\;\; \theta_t = \theta_{t-1} + \beta_t\, v_t k_t^\top \quad\text{(linear attention)}

L_t(\theta) = \tfrac{1}{2}\|v_t - M_\theta(k_t)\|^2 \;\;\Longrightarrow\;\; \theta_t = \theta_{t-1} + \beta_t (v_t - \theta_{t-1} k_t)\, k_t^\top \quad\text{(DeltaNet)}

Axis 3 — inner optimizer.

This is how the next $\theta_t$ is computed from $\nabla L_t$ .

The default is one GD step. Linear attention, Mamba2, DeltaNet, Gated DeltaNet, and TTT-MLP all use this — what differs across them is the objective, not the optimizer. Titans is the first to iterate on the optimizer by introducing momentum. Later we will see Muon (ATLAS), the second optimizer move, also heavily used today when training LLMs.

Axis 4 — retention on $\theta$ .

We motivated this in Part 1 as a necessity to erase stale writes (key-value pairs that are not important anymore but take up space). MIRAS (Behrouz et al., 2025) further shows a derivation that this can be seen as regularization on our state $\theta$ in the form of adding $\|\theta_t - \alpha_t \theta_{t-1}\|^2$ to $L_t$ .

Whereas DeltaNet and TTT-MLP used none, Mamba2, Gated DeltaNet and Titans used a scalar $\alpha_t$ . In §5 we will see KDA introducing an extension.

Where this leaves us.

The recurrence $S_t = S_{t-1} A_t + b_t k_t^\top$ from Part 1 is the special case where the memory is linear (single matrix), the objective is dot-product or L2, the optimizer is one GD step, and retention is at most a scalar. The four axes generalize each of those choices. Every model in this post — including the two we still have to introduce — slots into a row of one shared table; we’ll fill it in at the end of §6 once KDA and ATLAS are placed.

5. Kimi Linear / KDA

So far retention was always a scalar $\alpha_t \in \mathbb{R}$ that acted on the whole state but what if we make it a vector $\alpha_t \in \mathbb{R}^{d_k}$ where each channel/dimension of $k$ gets its own retention scalar?

This is what Kimi Linear (Moonshot AI, 2025) introduces, giving us the following recurrence

S_t = S_{t-1}\, \mathrm{Diag}(\alpha_t)\, (I - \beta_t k_t k_t^\top) + \beta_t v_t k_t^\top.

Note that they write in their paper $S \in \mathbb{R}^{d_k \times d_v}$ ; transposed to our convention $S \in \mathbb{R}^{d_v \times d_k}$ .

The transition matrix on the right of $S_{t-1}$ is a DPLR matrix (diagonal-plus-low-rank, rank-1 here), supporting fast matrix-vector ops and a closed-form inverse via Sherman–Morrison.

\mathrm{Diag}(\alpha_t)\,(I - \beta_t k_t k_t^\top) = \mathrm{Diag}(\alpha_t) - \beta_t (\alpha_t \odot k_t)\, k_t^\top.

With this per-channel gating, each $\alpha_{t,i}$ controls the decay of the $i$ -th key-coordinate independently, giving the model fine-grained control over what to forget. Analogous to how the Adam optimizer has per-parameter learning rates.

Scalar gates $\alpha_t$ forces the same forgetting rate across all key-coordinates, but different coordinates might encode information that ages at different rates (positional vs. semantic information). Diagonal $\alpha_t$ lets the model trade off retention per coordinate.

MIRAS-lens. Same L2 reconstruction loss as DeltaNet, same one-step GD optimizer, only change is now the channel-wise retention as regularizer toward the channel-decayed prior:

\tilde L_t(S) = \tfrac{1}{2}\|v_t - S k_t\|^2 + \tfrac{1}{2\eta_R}\big\|S - S_{t-1}\,\mathrm{Diag}(\alpha_t)\big\|_F^2.

The exact (closed-form) minimizer of $\tilde L_t$ is the KDA recurrence above, with reparameterization $\beta_t = \eta_R / (1 + \eta_R)$ . (Derivation: set $\nabla \tilde L_t = 0$ , invert via Sherman–Morrison.)

6. ATLAS

ATLAS (Behrouz et al., 2025) introduces three independent changes vs. Titans that can each be placed on Axes 1, 2, and 3.

6.1 Omega rule — windowed inner objective (Axis 2).

Up to here every model’s memory loss has been a function of the current token only. ATLAS’s Omega rule sums the loss over a window of $c$ recent tokens, with per-token in-window gates $\gamma_i^{(t)} \in [0,1]$ :

L_t(\theta) = \sum_{i=t-c+1}^{t} \gamma_i^{(t)}\, \tfrac{1}{2}\|v_i - M_\theta(k_i)\|^2.

For linear memory ( $M_\theta(k) = S k$ , $\theta = S$ ), one GD step gives:

S_t = S_{t-1}\Big(I - \beta_t \sum_{i=t-c+1}^{t} \gamma_i^{(t)}\, k_i k_i^\top\Big) + \beta_t \sum_{i=t-c+1}^{t} \gamma_i^{(t)}\, v_i k_i^\top.

The edge case $c=1$ gives us DeltaNet/TTT-Linear, while $c \to \infty$ recovers Mesa-layer-style global least-squares (which we will not cover further). Each token’s gradient enters the update sum at $c$ different timesteps. This is different from momentum, which stores the gradient once and lets it decay through a buffer — reusing a potentially outdated gradient direction. The Omega rule is more like mini-batch gradient descent on a sliding window, where the gradients are recomputed each step. Footnote: not quite true in practice due to chunkwise computation, introduced in Part 3.

6.2 Kernel feature maps (Axis 1).

ATLAS adds (or rather reintroduces from our view) a feature map $\phi$ on keys (and queries). The inner loss applies $M_\theta$ to $\phi(k_i)$ instead of $k_i$ :

\ell(\theta; k_i, v_i) = \tfrac{1}{2}\|v_i - M_\theta(\phi(k_i))\|^2.

For polynomial $\phi_p$ of degree $\le p$ , the effective key dimension grows from $d_k$ to $\binom{d_k + p}{p}$ , and the capacity ceiling rises from $\mathcal{O}(d_k)$ to $\mathcal{O}(d_k^p)$ (paper’s Proposition 2). The exponential kernel $\phi^*$ — Taylor expansion of $\exp$ — is the $p \to \infty$ limit and recovers softmax attention as the global-window special case (paper §4.2).

6.3 Muon (Axis 3).

Notation note: the ATLAS paper writes $\mathcal{M}_t$ for parameters, $\mathcal{S}_t$ for the momentum buffer, $\theta_t$ for the momentum decay scalar, and $k$ for the Newton–Schulz iteration count (clashes with our key vector). Stripped to our convention: parameters $\theta_t$ , momentum buffer $m_t$ , momentum decay $\nu_t$ , learning rate $\beta_t$ , retention $\alpha_t$ , Newton–Schulz iteration count $J$ .

The full ATLAS recurrence:

m_t = \nu_t\, m_{t-1} + \nabla_\theta L_t^{\text{Omega}}(\theta_{t-1}),

\theta_t = \alpha_t\, \theta_{t-1} - \beta_t \cdot \texttt{NewtonSchulz}_J(m_t),

where $L_t^{\text{Omega}}(\theta) = \sum_{i=t-c+1}^{t} \gamma_i^{(t)}\, \tfrac{1}{2}\|v_i - M_\theta(\phi(k_i))\|^2$ is the windowed loss over the past $c$ tokens.

Two-state recurrence with the same shape as Titans. $m_t$ is the EMA of recent (windowed) gradients with decay $\nu_t$ . $\theta_t$ is the memory parameters with scalar retention $\alpha_t$ per step. The novel piece is the $\texttt{NewtonSchulz}_J$ wrapper around the buffer.

Muon intuition. Momentum buffers for matrix-shaped parameters tend to be low-rank and ill-conditioned (a few dominant directions). Orthogonalization rescales every singular value to 1, producing a step that “moves equally in every direction” of the gradient subspace. As $J \to \infty$ , $\texttt{NewtonSchulz}_J(m_t)$ converges to the nearest semi-orthogonal matrix to $m_t$ — a second-order approximation in the sense that it inverts the local curvature scale.

Summary. ATLAS introduces three changes: (1) Omega: per-token loss is essentially online SGD with batch size 1; window- $c$ gives the inner step access to recent context, like minibatch GD. (2) Kernels: Titans’s MLP memory still bottlenecks at the matrix-output dimension $d_k$ ; polynomial features lift the capacity ceiling to $\mathcal{O}(d_k^p)$ . (3) Muon: matrix-shaped momentum buffers concentrate energy in a few directions; Newton–Schulz redistributes step-size across the gradient subspace.

7. Overview

This gives us the following complete overview of all architectures including Part 1:

Figure 2

The MIRAS 4-axis lens, updated with KDA and ATLAS.

model	memory architecture	objective	optimizer	retention
linear attention	linear $M_\theta(k) = \theta k$	$-v^\top M_\theta(k)$	1-step GD, scalar $\beta_t$	identity
Mamba2	linear $M_\theta(k) = \theta k$	$-v^\top M_\theta(k)$	1-step GD, scalar $\beta_t$	scalar $\alpha_t$
DeltaNet	linear $M_\theta(k) = \theta k$	$\tfrac{1}{2}\\|v - M_\theta(k)\\|^2$	1-step GD, scalar $\beta_t$	identity
Gated DeltaNet	linear $M_\theta(k) = \theta k$	$\tfrac{1}{2}\\|v - M_\theta(k)\\|^2$	1-step GD, scalar $\beta_t$	scalar $\alpha_t$
TTT-MLP	2-layer MLP $M_\theta(k) = W_2\sigma(W_1 k)$	$\tfrac{1}{2}\\|v - M_\theta(k)\\|^2$	1-step GD, scalar $\beta_t$	identity
Titans	2-layer MLP $M_\theta(k) = W_2\sigma(W_1 k)$	$\tfrac{1}{2}\\|v - M_\theta(k)\\|^2$	1-step GD + momentum (buffer $m_t$ )	scalar $\alpha_t$
KDA	linear $M_\theta(k) = \theta k$	$\tfrac{1}{2}\\|v - M_\theta(k)\\|^2$	1-step GD, scalar $\beta_t$	diagonal $\alpha_t \in \mathbb{R}^{d_k}$
ATLAS	MLP $\circ\,\phi$ (poly/exp kernel on $k$ )	$\sum_{i=t-c+1}^{t} \gamma_i^{(t)} \tfrac{1}{2}\\|v_i - M_\theta(\phi(k_i))\\|^2$	GD + momentum + Newton–Schulz (Muon)	scalar $\alpha_t$

Each paper typically introduces a single change from the baseline of linear attention; some are combinations of earlier ones rather than genuinely new (Gated DeltaNet pairs DeltaNet’s L2 objective with Mamba2’s retention). ATLAS is the exception, moving three axes at once.

8. Outlook

What we’ve neglected so far was talking about the practicality of training these methods on modern accelerated hardware. Per-token updates are inherently sequential — nice for inference and what motivated our derivation from softmax attention — but it kills training throughput.

Part 3 covers chunkwise parallelization (DeltaNet) and further practical tricks that we will have to employ to train these recurrent algorithms fast.

We will implement all architectures mentioned so far (including Part 1) and benchmark them on MQAR to read off the capacity ceilings empirically.

Using the four axes from MIRAS lets us clearly see what changes across these architectures and which axis each paper moves. Part 3 will be more about what it actually takes to run them at modern-LLM scale and how they benchmark on MQAR.

References

Yang, Wang, Zhang, Shen, Kim (2024). Parallelizing Linear Transformers with the Delta Rule over Sequence Length. NeurIPS 2024. arXiv:2406.06484
Sun, Li, Geng, Hua, Wang, Zhao, Liu, Hardt, Chen, Pan, Lin, Wang, Han, Guestrin (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv:2407.04620
Behrouz, Zhong, Mirrokni (2024). Titans: Learning to Memorize at Test Time. arXiv:2501.00663
Behrouz, Razaviyayn, Zhong, Mirrokni (2025). It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization. arXiv:2504.13173
Zhong, Xu, Ao, Shi (2025). Understanding Transformer from the Perspective of Associative Memory. arXiv:2505.19488
Behrouz, Razaviyayn, Zhong, Mirrokni (2025). ATLAS: Learning to Optimally Memorize the Context at Test Time. arXiv:2505.23735
Moonshot AI (2025). Kimi Linear: An Expressive, Efficient Attention Architecture. arXiv:2510.26692