Hierarchical Reasoning Model

Brain-Inspired Architecture for Deep Reasoning | 27M Parameters | Solves Complex Puzzles with 1000 Examples

Model Statistics

27M Parameters
512 Hidden Size
8 Attention Heads
4+4 H+L Layers
Overview
Forward Pass
Gradient Flow
ACT Mechanism

Architecture Overview

Input Processing Token Embedding + Puzzle Embedding + Position Encoding L-Module (Low-level Reasoning) Fast, Detailed Computations z_L^(t) = L(z_L^(t-1), z_H^(c), x) 4 Transformer Layers | Updates every timestep Participation Ratio: 30.22 H-Module (High-level Planning) Slow, Abstract Strategy z_H^(c+1) = H(z_H^(c), z_L^(final)) 4 Transformer Layers | Updates every L_cycles Participation Ratio: 89.95 Output Head LM Head (Vocabulary Projection) y = O(z_H^(final)) Q-Head Halt/Continue Decision Q(halt) Q(continue) Every L cycles t->t+1 c->c+1 L_cycles = 2 H_cycles = 2

Input Processing

Token Embedding + Puzzle Embedding + Position Encoding

L-Module

Fast, Detailed Computations

  • 4 Transformer Layers
  • Updates every timestep
  • Participation Ratio: 30.22

H-Module

Slow, Abstract Strategy

  • 4 Transformer Layers
  • Updates every L_cycles
  • Participation Ratio: 89.95

Q-Head for ACT

Halt/Continue Decision

Q(halt) Q(continue)

Forward Pass Flow

Forward Pass Flow (Single Segment) Input Tokens Embeddings Token: vocab->512 Puzzle: sparse->512 Pos: RoPE/Learned L-Module Loop for t in range(L_cycles * H_cycles): z_L = L_level(z_L, z_H + input) if t % L_cycles == 0: z_H = H_level(z_H, z_L) H-Module Updates Total updates: H_cycles Dimensionality: 512 Output Generation logits = lm_head(z_H[:, puzzle_emb_len:]) ACT Decision q_logits = q_head(z_H[:, 0]) halt if Q(halt) > Q(continue) Recur Timesteps: Total: H_cycles x L_cycles = 4 L updates: 4 H updates: 2 Max segments: 16

Processing Pipeline Details

  • Each segment processes input through hierarchical modules
  • L-Module runs 4 times per segment (fast, detailed)
  • H-Module updates 2 times per segment (slow, strategic)
  • ACT decides whether to continue or halt after each segment

One-Step Gradient Approximation

Gradient Flow Visualization Loss Computation L = L_seq2seq + 0.5 x (L_q_halt + L_q_continue) H-Module Final State grad z_H^(final) (with gradient) L-Module Final State grad z_L^(final) (with gradient) Input Embeddings grad x (gradient to embeddings) Detached States All intermediate states are detached (no BPTT) Memory Usage HRM: O(1) BPTT: O(T) T = timesteps Gradient Approximation Formula: grad_theta L ~ dL/dy * dy/dz_H^(final) * dz_H^(final)/dz_L^(final) * dz_L^(final)/dx * dx/d_theta Based on Deep Equilibrium Model (DEQ) theory Uses Implicit Function Theorem at convergence

Detached States

All intermediate states are detached (no BPTT)

This enables O(1) memory complexity

Memory Comparison

HRM: O(1)
BPTT: O(T)

Adaptive Computation Time

ACT Decision Process Q-Learning Setup State: z_H^(segment) Actions: {halt, continue} Reward: correct ? 1 : 0 Halting Decision if segment >= max_steps: HALT if Q(halt) > Q(continue) AND segment >= min_steps: HALT Exploration eps = 0.1 min_steps ~ U(2, max_steps) Q-Value Updates Q_target(halt) = reward Q_target(continue) = sigma(max(Q'(halt), Q'(continue))) Loss = BCE(Q_pred, Q_target) Deep Supervision Loop Segment 1 Forward Segment 2 Forward Segment 3 Forward ... Segment N HALT Carry states detached between segments

Q-Learning Configuration

  • State: Hidden representation z_H
  • Actions: Binary (halt/continue)
  • Reward: Task accuracy signal
  • Learning: Online during training

Exploration Strategy

  • epsilon-greedy with eps = 0.1
  • Dynamic min_steps sampling
  • Ensures sufficient computation
  • Prevents premature halting

Key Features

Hierarchical Convergence:

L-module converges locally, H-module provides global context

Multi-timescale:

Fast L-module (every step) vs Slow H-module (every L_cycles)

No BPTT:

One-step gradient approximation, O(1) memory

ACT:

Adaptive computation via Q-learning

Deep Supervision:

Learning signal at every segment

Performance Highlights

ARC-AGI-1

(beats o3-mini)

40.3%

Sudoku Extreme

accuracy

98.5%

Maze 30x30

optimal paths

99%

Training

examples only!

1000

Brain Correspondence

Dimensionality Hierarchy:

H-module PR=89.95 vs L-module PR=30.22

Similar to Cortex:

Ratio matches mouse cortical hierarchy

Oscillatory Learning:

Inspired by theta-gamma coupling

Local Credit Assignment:

Biologically plausible gradient

Implementation Details

# Core HRM Forward Pass
for segment in range(max_segments):
    # Reset carry for halted sequences
    carry = reset_carry(halted, carry)

    # Hierarchical computation
    for h_step in range(H_cycles):
        for l_step in range(L_cycles):
            z_L = L_level(z_L, z_H + input)
        z_H = H_level(z_H, z_L)

    # ACT decision
    q_halt, q_continue = q_head(z_H)
    if q_halt > q_continue:
        break