The Illusion of Deep Learning
The Illusion of Deep Learning
I made a new year resolution (almost two months late, but who's counting) to read and study at least one paper every few weeks, properly. Not skim-it-on-the-train properly, but sit-down-and-work-through-the-math properly. To keep myself accountable, I'm writing up a few words about each one. Partly for this blog, partly because my memory is terrible and I need somewhere to store things. These are semi-technical notes, mostly for personal use, but if they're useful to someone else, great.
I should be upfront: I'm not an ML expert. I come from pure mathematics and only recently got interested in machine learning. That background shapes how I see things, for better or worse.
The thing that kept bugging me
When I started studying basic data science and ML, something kept nagging at me. Everything looked like gradient descent. Nine times out of ten, it was "pick a loss function, run gradient descent, done." To my very category-theory-poisoned brain, all these techniques looked the same up to isomorphism. People kept inventing fancy new names for what felt like the same underlying pattern with a different objective function bolted on.
I know I'm oversimplifying. But I'm a simple-minded person and I like simple things.
Today's paper: "The Illusion of Deep Learning"
This paper went viral a while ago (a long time ago in "ML years," though by pure math standards it would still count as hot off the press). The gap between ML time and math time is something like 10,000x, and I'm still adjusting.
The core claim is wild: the paper argues that the distinction between "model architecture" and "optimizer" is an illusion. In their framework, called Nested Learning (NL), everything is an optimization problem. A single neuron, an attention head, the Adam optimizer, all of it. They're all doing the same thing at different levels of abstraction.
This is the part that made me sit up, because it's precisely what my intuition was telling me, now dressed up in proper formalism.
The fundamental building block: Associative Memory
In classical ML, we think of layers as functions . In NL, we view a layer as the solution to an optimization problem.
Given a stream of keys and values , an associative memory operator is defined as the minimizer of a regression objective:
From this single definition, the paper derives everything else. The key insight is that the specific equations you'll see in the rest of this post are all instances of this general definition, obtained by choosing a particular objective function .
How the equations relate: the hierarchy
The three main equations in the paper form a hierarchy of abstraction. They move from the general definition to specific implementations used in different learning algorithms. Getting this hierarchy straight is what made the paper click for me.
Level 0: The general definition
This is the root of the entire framework. It says: any learning process, whether a layer, an optimizer, or a neuron, is an associative memory problem. The components are:
- : the memory operator (a weight matrix, a neural network, an optimizer state).
- (keys): the input context or trigger data (input tokens , or gradients ).
- (values): the target signal the memory should retrieve or compress (the next token, the error signal).
- : the objective function measuring the quality of the mapping. It asks: how well does , when triggered by , reconstruct or align with ?
This is the abstract parent. The next two equations are "subclasses" derived by choosing a specific .
Level 1a: The regression instance (Delta Rule)
Derived by choosing the norm (MSE) as the objective. Set to be a linear map, set , and substitute into the general definition. This choice of objective leads to the Delta Rule (used in models like Mamba/RWKV and the paper's HOPE architecture). It forces the model to minimize distance to the target, which allows it to overwrite or delete incorrect information from memory.
Level 1b: The gradient descent instance (Standard Backpropagation)
This one is tricky and I think it's where most confusion lives. Standard gradient descent does not minimize distance. It maximizes alignment. The objective here is a negative dot product, not a squared error.
The derivation: set the "value" to be the negative gradient (the error signal), . Set the internal objective to . This asks the memory to find a weight such that the output is maximally aligned with the error signal direction. Add a proximal regularization term to keep weights stable, and you get:
Substituting flips the sign, giving the form in the paper.
Why this matters
The paper argues that Level 1b (standard GD / Transformers) is limited because it only learns directions (Hebbian-style updates: ). Level 1a (HOPE / Delta Rule) is more expressive because it learns distances, enabling precise memory overwriting. That's the difference between a system that accumulates information and one that corrects itself.
Backpropagation as proximal optimization
Expanding on Level 1b from the hierarchy above: a standard gradient descent step turns out to be equivalent to the proximal operator:
where is the local error signal (the gradient with respect to the layer's output, not the weight gradient). The weight matrix is trying to memorize the mapping between the input and this error signal. The first term is the linearization of the loss (first-order Taylor expansion). The second term is proximal regularization, keeping the new weight close to the old one.
Optimizers are memories of gradients
If the weights compress the data, the optimizer compresses the gradients.
Standard Momentum () is a linear associative memory solving:
It acts as a low-pass filter on the gradient manifold. The paper argues this is not enough for complex loss landscapes (think orthogonal tasks in continual learning), which motivates their new optimizer.
Architectures are optimization steps
The paper shows that common layers are solutions to specific regression problems.
Self-Attention in Transformers is the non-parametric solution to an regression objective. It finds the matrix that minimizes the error between projected values and retrieved values. Similarly, modern RNNs (Linear Transformers, Mamba-style models) perform gradient descent steps on a hidden state matrix to minimize a local reconstruction loss.
The punchline: when you design a new layer, you're implicitly defining a new internal loss function that the layer tries to minimize during the forward pass. You're not "stacking blocks." You're choosing objectives.
Optimizers are models
This is the part that felt most radical to me. We usually think of the optimizer as the tool that trains the model. The paper flips this: Adam itself is a model.
Momentum-based optimizers are associative memory modules trying to compress the history of gradients into a momentum buffer. Adam in particular is the optimal solution to a specific regression problem where the optimizer state tries to predict the variance of the gradients. If Adam is a model minimizing a loss, then training a network is a meta-optimization process: one model (the optimizer) trains another model (the network), and both minimize their own objectives via gradient descent.
For me this was the satisfying bit. The thing I'd been vaguely gesturing at ("it's all gradient descent!") has a real mathematical backbone here.
Backpropagation as self-referential learning
The paper reinterprets backpropagation as a self-referential process. The weight matrix is an associative memory that maps the input to the local error signal . But the target for this memory, the error signal, is generated by the model itself. So the model produces its own training data (error signals) and then updates itself to memorize that data. The model and the learning algorithm are intertwined, not separate.
The HOPE architecture
The paper proposes a concrete implementation of NL called HOPE, which replaces standard Transformer blocks with two components.
Self-Modifying Titans (high frequency, replaces Attention)
In a standard RNN or Linear Attention, the memory update is fixed (e.g., the Hebbian rule ). In HOPE, the model generates its own update rules per token.
Given input , the model projects to generate:
- (standard key/value/query)
- (dynamic learning rate)
- (dynamic decay/forget gate)
The memory matrix is updated by minimizing the reconstruction error of the value given key :
Applying one step of gradient descent with the model-generated learning rate :
The self-referential part: is output by the network itself. The network learns how fast to learn at any given moment.
Continuum Memory System (low frequency, replaces the MLP)
Standard MLPs are static after pre-training. CMS is a stack of MLPs that update at different timescales.
The structure is a sequence of MLPs , where block updates its weights every steps (chunk size). It runs a local optimization loop (like SGD) on the data chunk it saw, permanently altering its weights. This is what enables "learning at test time" without catastrophic forgetting.
A nice engineering property: since the weights only change at chunk boundaries, you get sequence parallelism for free. You can parallelize the forward/backward pass within a chunk, unlike standard RNNs which are sequential.
The M3 optimizer
Standard SGD is not covariant: it depends on the coordinate system. Newton's method fixes this using the Hessian inverse , but computing that is .
M3 (Multi-scale Momentum Muon) approximates the whitening of the update using the Newton-Schulz iteration. Given a matrix (gradient/momentum), we want to map it to an orthogonal matrix (where ):
This converges quadratically to the polar factor of .
The full M3 algorithm:
- Maintain multiple momentum buffers at different timescales: fast momentum , slow momentum .
- Update slow momentum only every steps (accumulate gradients).
- Apply Newton-Schulz to orthogonalize the update.
- Combine: .
This ensures the optimizer remembers global directions (slow momentum) while reacting to local curvature (fast momentum), all in an orthogonalized basis for stability.
Delta Gradient Descent (DGD)
One more formula worth keeping for reference. Instead of standard Linear Attention, the paper proposes the DGD update rule, which adds a data-dependent decay term:
Standard Hebbian learning only adds information (). DGD allows the model to erase specific directions () in its memory matrix when they become irrelevant. This is essential for recall tasks where you need to overwrite old data.
What I'm taking home
The paper's central message, and the thing that validates my half-formed intuition: stop treating architecture and optimization as separate fields. They are the same thing at different levels of abstraction. Designing a better model is designing a better optimization landscape, and designing a better optimizer is designing a better memory architecture for gradients.
Layers are gradient descent steps on local losses, compressing input-output mappings. Optimizers are gradient descent steps on gradient losses, compressing gradient history. Learning is the interaction of these nested loops passing signals between each other.
For someone coming from pure math, this is a satisfying kind of unification. It reminds me of how category theory collapses apparently different constructions into instances of the same universal property. The paper does something analogous for deep learning: it finds the common structure hiding underneath the zoo of architectures, optimizers, and learning rules.
Questions I want to work through
A few exercises for future me:
- Derive how the regression objective leads to the DGD update rule using the Sherman-Morrison inverse formula (Appendix C of the paper).
- Why does orthogonalizing the momentum via Newton-Schulz help with feature learning? (It forces the update to be isometric, preventing the vanishing gradient effect in deep linear networks.)
- How does the Continuum Memory System interact with the nested frequency hierarchy when you scale to very long contexts?
Whether this leads to practical improvements at scale is a different question, and one I'm not qualified to answer. But as a way of thinking about what ML is doing, I find it clarifying. And for now, that's enough.