Nested Learning & HOPE Architecture

Multi-level optimization with test-time learning (NeurIPS 2025)

Overview

Architecture

Optimizers

Training

Inference

Nested Learning Paradigm

A model is a hierarchy of nested optimization problems, each compressing its own "context flow"

Associative Memory (Definition 1)

M* = arg min_M L̃(M(K); V)

Memory learns to map Keys → Values by minimizing objective L̃

Learning vs. Memorization (from paper)

Memory: A neural update caused by an input

Learning: The process for acquiring effective and useful memory

Update Frequency Hierarchy (Definition 2)

Higher level = Lower frequency. A ≻ B means f_A > f_B

Level 1: Memory M_t f = 1/token (fastest)

Level 2: Projections W_k, W_v, W_q f = 1/batch

Level 3: Momentum f = 1/batch

Level 4: Pre-training f = 1/epoch (slowest)

Key Insight

Training: All levels update at their respective frequencies. Each level has its own gradient flow and context.

Inference: Only Level 1 (Memory) updates! This enables "test-time learning" - the model continues learning without backpropagation.

Transformers: Are a special case of CMS with k=1 (single MLP). All components freeze at inference → "anterograde amnesia".

Architecture Comparison

Transformer

Static after pre-training (CMS with k=1)

x_t

Multi-Head Attention

W_q W_k W_v

Static projections

q_t=x_tW_q k_t=x_tW_k v_t=x_tW_v

softmax(QK^T/√d)V

❄ All frozen at inference

Add & LayerNorm

Feed-Forward Network

(CMS with k=1: single MLP)

→

❄ Frozen at inference

y_t

Limitation: All parameters frozen after pre-training. Like "anterograde amnesia" - cannot form new long-term memories.

HOPE

Self-Modifying Titans + Continuum Memory

x_t

Self-Modifying Memory (Titans-based)

W_q W_k W_v

(+ potential data-dependent components)

Memory M_t 🔥 ACTIVE AT INFERENCE

M_t+1 = M_t(I - k_tk_t^⊤) - η∇L

Delta rule (Eq. 28-29): considers token dependencies

M_t

✨ Online learning at test time!

y_t = M_t · q_t (Eq. 14)

Continuum Memory System (CMS)

Eq. 30: y_t = MLP^(f_k)(...MLP^(f₁)(x_t))

MLP¹ f₁ fastest

→

MLP² f₂

→

MLP³ f₃

→

MLP^k f_k slowest

f₁ > f₂ > ... > f_k (frequency hierarchy)

Each θ^(f_ℓ) updates every C^(ℓ) steps (Eq. 31)

y_t

Key Innovation: Memory updates at inference using delta rule, enabling continual learning. CMS provides multi-timescale knowledge storage.

Transformer

x_t

Attention

❄ Frozen

FFN (k=1)

❄ Frozen

y_t

HOPE

x_t

Memory M_t

🔥 Active

CMS (k MLPs)

Multi-frequency

y_t

Optimizers as Associative Memory

How gradient descent with momentum becomes a 2-level nested optimization

Gradient Descent with Momentum (Eq. 17)

m_i+1 = α_i+1m_i - η_t∇L(W_i; x_i)
W_i+1 = W_i + m_i+1

Note: Paper uses + not - in weight update

↓

Reinterpret as Nested Optimization

Level 2: Weight Update

W_i+1 = W_i + m_i+1

Level 1: Momentum as Memory (Eq. 10)

m_t+1 = arg min_m -⟨m, ∇L(W_t; x_t+1)⟩ + η_t+1∥m - m_t∥²

Momentum is a "key-less" associative memory compressing gradients

Extensions (Section 2.3)

More Expressive Objectives

Use L2 regression instead of dot-product → Delta rule (Eq. 21-22)

More Expressive Memory

Replace linear momentum with MLP → Deep Momentum GD (Eq. 23)

Non-Linear Outputs

σ(·) = Newton-Schulz → Muon optimizer (Eq. 24)

Better Backprop

L2 objective considers token dependencies → Eq. 28-29

From Section 2.3: "Adam with a small modification is the optimal associative memory for the models' gradients" (see Appendix C.4)

Training Mode

All levels update at their respective frequencies

Level 4: Pre-training (f = 1/epoch) - Outermost loop

Level 3: Momentum (f = 1/batch) - Gradient compression

m_i+1 = α_i+1m_i - η∇L(W_i; x_i)
W_i+1 = W_i + m_i+1 (Eq. 17)

Level 2: Projections W_k, W_v, W_q (f = 1/batch)

Level 1: Memory M_t (f = 1/token) - Context compression

Linear Attention (Eq. 13):

M_t = M_t-1 + v_tk_t^⊤

Equivalent to: arg min_M ⟨Mk_t, v_t⟩ + ∥M - M_t-1∥²

M_t

M_t-1

Training Flow

Token arrives → Update Memory M_t (Eq. 13)

Level 1 (fastest): Memory updates every token. Inner optimization with dot-product objective.

Batch complete → Backprop through projections

Level 2 activates: Projection layers optimized with accumulated gradients.

Compute loss → Update momentum (Eq. 17)

Level 3: Momentum term is itself an associative memory (key-less).

Epoch complete → Aggregate all nested updates

Level 4 (slowest): Outermost optimization aggregates all nested updates.

Inference Mode

Only Level 1 (Memory) updates - Test-time learning!

Input Token

x_t

W_k ❄

W_v ❄

W_q ❄

Eq. 12: k_t=x_tW_k, v_t=x_tW_v, q_t=x_tW_q

k_t v_t q_t

Memory Update ACTIVE!

M_t+1 = M_t(I - k_tk_t^⊤) - η∇L(M_t; k_t, v_t)

Delta rule (Eq. 28-29) - handles token dependencies

M_t

🔥 Only this updates at inference!

y_t = M_t · q_t (Eq. 14)

Continuum Memory System (Eq. 30-31)

MLP¹

MLP²

...

MLP^k

Each MLP updates at frequency f_ℓ

Inference Flow

New token x_t arrives

At inference, only Level 1 (Memory) updates. All other levels are frozen.

Project through frozen W_k, W_v, W_q (Eq. 12)

Projections are 'slow weights' - consolidated knowledge from pre-training.

Update memory with delta rule (Eq. 28-29)

Unlike simple Hebbian (M_t + v_t·k_t^⊤), delta rule manages memory capacity better.

Query retrieves: y_t = M_t·q_t

Model combines fast (memory) and slow (frozen) knowledge for prediction.

Key Equations Reference

Eq. 1: M* = arg min L̃(M(K); V)

Eq. 12-14: Linear attention formulation

Eq. 17: GD with momentum

Eq. 28-29: Delta rule for HOPE

Eq. 30-31: CMS formulation

Def. 2: Update frequency

Nested Learning & HOPE Architecture

Nested Learning Paradigm

Learning vs. Memorization (from paper)

Update Frequency Hierarchy (Definition 2)

Key Insight

Architecture Comparison

Transformer

Multi-Head Attention

Feed-Forward Network

HOPE

Self-Modifying Memory (Titans-based)

Continuum Memory System (CMS)

Transformer

Attention

FFN (k=1)

HOPE

Memory M_t

CMS (k MLPs)

Optimizers as Associative Memory

Extensions (Section 2.3)

More Expressive Objectives

More Expressive Memory

Non-Linear Outputs

Better Backprop

Training Mode

Training Flow

Inference Mode

Continuum Memory System (Eq. 30-31)

Inference Flow

Key Equations Reference

Company

Product

Resources

Nested Learning & HOPE Architecture

Nested Learning Paradigm

Learning vs. Memorization (from paper)

Update Frequency Hierarchy (Definition 2)

Key Insight

Architecture Comparison

Transformer

Multi-Head Attention

Feed-Forward Network

HOPE

Self-Modifying Memory (Titans-based)

Continuum Memory System (CMS)

Transformer

Attention

FFN (k=1)

HOPE

Memory Mt

CMS (k MLPs)

Optimizers as Associative Memory

Extensions (Section 2.3)

More Expressive Objectives

More Expressive Memory

Non-Linear Outputs

Better Backprop

Training Mode

Training Flow

Inference Mode

Continuum Memory System (Eq. 30-31)

Inference Flow

Key Equations Reference

Company

Product

Resources

Memory M_t