Back to Blog

The Illusion of Deep Learning

The Illusion of Deep Learning

I made a new year resolution (almost two months late, but who's counting) to read and study at least one paper every few weeks, properly. Not skim-it-on-the-train properly, but sit-down-and-work-through-the-math properly. To keep myself accountable, I'm writing up a few words about each one. Partly for this blog, partly because my memory is terrible and I need somewhere to store things. These are semi-technical notes, mostly for personal use, but if they're useful to someone else, great.

I should be upfront: I'm not an ML expert. I come from pure mathematics and only recently got interested in machine learning. That background shapes how I see things, for better or worse.

The thing that kept bugging me

When I started studying basic data science and ML, something kept nagging at me. Everything looked like gradient descent. Nine times out of ten, it was "pick a loss function, run gradient descent, done." To my very category-theory-poisoned brain, all these techniques looked the same up to isomorphism. People kept inventing fancy new names for what felt like the same underlying pattern with a different objective function bolted on.

I know I'm oversimplifying. But I'm a simple-minded person and I like simple things.

Today's paper: "The Illusion of Deep Learning"

This paper went viral a while ago (a long time ago in "ML years," though by pure math standards it would still count as hot off the press). The gap between ML time and math time is something like 10,000x, and I'm still adjusting.

The core claim is wild: the paper argues that the distinction between "model architecture" and "optimizer" is an illusion. In their framework, called Nested Learning (NL), everything is an optimization problem. A single neuron, an attention head, the Adam optimizer, all of it. They're all doing the same thing at different levels of abstraction.

This is the part that made me sit up, because it's precisely what my intuition was telling me, now dressed up in proper formalism.

The fundamental building block: Associative Memory

In classical ML, we think of layers as functions f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m. In NL, we view a layer as the solution to an optimization problem.

Given a stream of keys {ki}\{k_i\} and values {vi}\{v_i\}, an associative memory operator MM is defined as the minimizer of a regression objective:

M=argminMiMkivi2M^* = \arg\min_M \sum_i \| M k_i - v_i \|^2

From this single definition, the paper derives everything else. The key insight is that the specific equations you'll see in the rest of this post are all instances of this general definition, obtained by choosing a particular objective function L\mathcal{L}.

How the equations relate: the hierarchy

The three main equations in the paper form a hierarchy of abstraction. They move from the general definition to specific implementations used in different learning algorithms. Getting this hierarchy straight is what made the paper click for me.

Level 0: The general definition L(M(K);V)\mathcal{L}(M(K); V)

This is the root of the entire framework. It says: any learning process, whether a layer, an optimizer, or a neuron, is an associative memory problem. The components are:

This is the abstract parent. The next two equations are "subclasses" derived by choosing a specific L\mathcal{L}.

Level 1a: The regression instance (Delta Rule)

M=argminMiMkivi2M^* = \arg\min_M \sum_i \| M k_i - v_i \|^2

Derived by choosing the L2L_2 norm (MSE) as the objective. Set MM to be a linear map, set L(y,v)=yv2\mathcal{L}(y, v) = \|y - v\|^2, and substitute into the general definition. This choice of objective leads to the Delta Rule (used in models like Mamba/RWKV and the paper's HOPE architecture). It forces the model to minimize distance to the target, which allows it to overwrite or delete incorrect information from memory.

Level 1b: The gradient descent instance (Standard Backpropagation)

W=argminW[ut+1,Wxt+12ηWW2]W' = \arg\min_{W'} \left[ \langle u_{t+1}, W' x_t \rangle + \frac{1}{2\eta} \| W' - W \|^2 \right]

This one is tricky and I think it's where most confusion lives. Standard gradient descent does not minimize L2L_2 distance. It maximizes alignment. The objective here is a negative dot product, not a squared error.

The derivation: set the "value" VV to be the negative gradient (the error signal), V=ut+1V = -u_{t+1}. Set the internal objective to Linner(y,V)=y,V\mathcal{L}_{\text{inner}}(y, V) = -\langle y, V \rangle. This asks the memory to find a weight WW' such that the output WxtW' x_t is maximally aligned with the error signal direction. Add a proximal regularization term to keep weights stable, and you get:

W=argminW[Wxt,Vmaximize alignment+12ηWW2proximal constraint]W' = \arg\min_{W'} \left[ \underbrace{-\langle W' x_t, V \rangle}_{\text{maximize alignment}} + \underbrace{\frac{1}{2\eta} \|W' - W\|^2}_{\text{proximal constraint}} \right]

Substituting V=ut+1V = -u_{t+1} flips the sign, giving the form in the paper.

Why this matters

The paper argues that Level 1b (standard GD / Transformers) is limited because it only learns directions (Hebbian-style updates: MM+vkM \leftarrow M + v k^\top). Level 1a (HOPE / Delta Rule) is more expressive because it learns distances, enabling precise memory overwriting. That's the difference between a system that accumulates information and one that corrects itself.

Backpropagation as proximal optimization

Expanding on Level 1b from the hierarchy above: a standard gradient descent step Wt+1=Wtηut+1xtW_{t+1} = W_t - \eta \, u_{t+1} \cdot x_t^\top turns out to be equivalent to the proximal operator:

W=argminW[ut+1,Wxt+12ηWW2]W' = \arg\min_{W'} \left[ \langle u_{t+1}, W' x_t \rangle + \frac{1}{2\eta} \| W' - W \|^2 \right]

where ut+1u_{t+1} is the local error signal (the gradient with respect to the layer's output, not the weight gradient). The weight matrix WW is trying to memorize the mapping between the input xtx_t and this error signal. The first term is the linearization of the loss (first-order Taylor expansion). The second term is proximal regularization, keeping the new weight close to the old one.

Optimizers are memories of gradients

If the weights compress the data, the optimizer compresses the gradients.

Standard Momentum (mtm_t) is a linear associative memory solving:

m=argminmtmgt2m^* = \arg\min_m \sum_t \| m - g_t \|^2

It acts as a low-pass filter on the gradient manifold. The paper argues this is not enough for complex loss landscapes (think orthogonal tasks in continual learning), which motivates their new optimizer.

Architectures are optimization steps

The paper shows that common layers are solutions to specific regression problems.

Self-Attention in Transformers is the non-parametric solution to an 2\ell_2 regression objective. It finds the matrix MM that minimizes the error between projected values and retrieved values. Similarly, modern RNNs (Linear Transformers, Mamba-style models) perform gradient descent steps on a hidden state matrix MtM_t to minimize a local reconstruction loss.

The punchline: when you design a new layer, you're implicitly defining a new internal loss function that the layer tries to minimize during the forward pass. You're not "stacking blocks." You're choosing objectives.

Optimizers are models

This is the part that felt most radical to me. We usually think of the optimizer as the tool that trains the model. The paper flips this: Adam itself is a model.

Momentum-based optimizers are associative memory modules trying to compress the history of gradients into a momentum buffer. Adam in particular is the optimal solution to a specific 2\ell_2 regression problem where the optimizer state tries to predict the variance of the gradients. If Adam is a model minimizing a loss, then training a network is a meta-optimization process: one model (the optimizer) trains another model (the network), and both minimize their own objectives via gradient descent.

For me this was the satisfying bit. The thing I'd been vaguely gesturing at ("it's all gradient descent!") has a real mathematical backbone here.

Backpropagation as self-referential learning

The paper reinterprets backpropagation as a self-referential process. The weight matrix WW is an associative memory that maps the input xtx_t to the local error signal ut+1u_{t+1}. But the target for this memory, the error signal, is generated by the model itself. So the model produces its own training data (error signals) and then updates itself to memorize that data. The model and the learning algorithm are intertwined, not separate.

The HOPE architecture

The paper proposes a concrete implementation of NL called HOPE, which replaces standard Transformer blocks with two components.

Self-Modifying Titans (high frequency, replaces Attention)

In a standard RNN or Linear Attention, the memory update is fixed (e.g., the Hebbian rule MM+vkM \leftarrow M + v k^\top). In HOPE, the model generates its own update rules per token.

Given input xtx_t, the model projects xtx_t to generate:

The memory matrix MtM_t is updated by minimizing the reconstruction error of the value vtv_t given key ktk_t:

t=Mt1ktvt2\ell_t = \| M_{t-1} k_t - v_t \|^2

Applying one step of gradient descent with the model-generated learning rate ηt\eta_t:

Mt=(1αt)Mt1ηtMtM_t = (1 - \alpha_t) M_{t-1} - \eta_t \nabla_M \ell_t

The self-referential part: ηt\eta_t is output by the network itself. The network learns how fast to learn at any given moment.

Continuum Memory System (low frequency, replaces the MLP)

Standard MLPs are static after pre-training. CMS is a stack of MLPs that update at different timescales.

The structure is a sequence of MLPs {f1,,fL}\{f_1, \ldots, f_L\}, where block fif_i updates its weights every CiC_i steps (chunk size). It runs a local optimization loop (like SGD) on the data chunk it saw, permanently altering its weights. This is what enables "learning at test time" without catastrophic forgetting.

A nice engineering property: since the weights only change at chunk boundaries, you get sequence parallelism for free. You can parallelize the forward/backward pass within a chunk, unlike standard RNNs which are sequential.

The M3 optimizer

Standard SGD is not covariant: it depends on the coordinate system. Newton's method fixes this using the Hessian inverse H1H^{-1}, but computing that is O(n3)O(n^3).

M3 (Multi-scale Momentum Muon) approximates the whitening of the update using the Newton-Schulz iteration. Given a matrix XX (gradient/momentum), we want to map it to an orthogonal matrix YY (where YY=IY^\top Y = I):

Y0=X/XY_0 = X / \|X\| Yk+1=Yk(3IYkYk2)Y_{k+1} = Y_k \left( \frac{3I - Y_k^\top Y_k}{2} \right)

This converges quadratically to the polar factor of XX.

The full M3 algorithm:

  1. Maintain multiple momentum buffers at different timescales: fast momentum mfastm_{\text{fast}}, slow momentum mslowm_{\text{slow}}.
  2. Update slow momentum only every KK steps (accumulate gradients).
  3. Apply Newton-Schulz to orthogonalize the update.
  4. Combine: ΔW=NS(γ1mfast+γ2mslow)\Delta W = \text{NS}(\gamma_1 m_{\text{fast}} + \gamma_2 m_{\text{slow}}).

This ensures the optimizer remembers global directions (slow momentum) while reacting to local curvature (fast momentum), all in an orthogonalized basis for stability.

Delta Gradient Descent (DGD)

One more formula worth keeping for reference. Instead of standard Linear Attention, the paper proposes the DGD update rule, which adds a data-dependent decay term:

Mt=Mt1+vtktα(Mt1kt)ktM_t = M_{t-1} + v_t k_t^\top - \alpha (M_{t-1} k_t) k_t^\top

Standard Hebbian learning only adds information (MM+vkM \leftarrow M + v k^\top). DGD allows the model to erase specific directions ((Mt1kt)kt(M_{t-1} k_t) k_t^\top) in its memory matrix when they become irrelevant. This is essential for recall tasks where you need to overwrite old data.

What I'm taking home

The paper's central message, and the thing that validates my half-formed intuition: stop treating architecture and optimization as separate fields. They are the same thing at different levels of abstraction. Designing a better model is designing a better optimization landscape, and designing a better optimizer is designing a better memory architecture for gradients.

Layers are gradient descent steps on local losses, compressing input-output mappings. Optimizers are gradient descent steps on gradient losses, compressing gradient history. Learning is the interaction of these nested loops passing signals between each other.

For someone coming from pure math, this is a satisfying kind of unification. It reminds me of how category theory collapses apparently different constructions into instances of the same universal property. The paper does something analogous for deep learning: it finds the common structure hiding underneath the zoo of architectures, optimizers, and learning rules.

Questions I want to work through

A few exercises for future me:

  1. Derive how the 2\ell_2 regression objective leads to the DGD update rule using the Sherman-Morrison inverse formula (Appendix C of the paper).
  2. Why does orthogonalizing the momentum via Newton-Schulz help with feature learning? (It forces the update to be isometric, preventing the vanishing gradient effect in deep linear networks.)
  3. How does the Continuum Memory System interact with the nested frequency hierarchy when you scale to very long contexts?

Whether this leads to practical improvements at scale is a different question, and one I'm not qualified to answer. But as a way of thinking about what ML is doing, I find it clarifying. And for now, that's enough.