What is a hierarchical reasoning model (HRM)?

Man looking a post-its on a whiteboard

Hierarchical reasoning models (HRMs), explained

A hierarchical reasoning model (HRM) is an experimental AI architecture designed to mimic the way the human brain processes information at different timescales and levels of complexity. Notably, an HRM model outperformed then-state-of-the-art large language models (LLMs) on multiple benchmarks that measure performance on complex reasoning tasks, despite being many times smaller and training on a drastically smaller dataset.

More specifically, HRMs are a distinct neural network architecture that applies a distinct algorithm for generating outputs and multiple distinct algorithms for optimizing model parameters during training. While they’re typically compared to LLMs by way of performance on certain benchmarks that have historically been dominated by reasoning LLMs, this is an apples-to-oranges comparison. HRMs are narrow, task-specific models designed explicitly for reasoning problems, whereas reasoning LLMs are generalist models that can be applied to reasoning problems (among many other tasks).

Though capable of complex problem-solving, HRMs are not capable of conversation, code generation, summarization or other tasks usually associated with generative AI models. An HRM must be trained directly on the kind of problem you want them to solve. LLMs, conversely, are typically pretrained on a massive quantity and variety of data, then instructed (through few-shot prompting) to solve novel problems by inferring the rules.

Central to the concept of HRMs are a “hierarchy” of recurrent loops that take inspiration from how the human brain processes information at different levels and frequencies. An “inner loop” consists of a module that rapidly performs low-level computations and another, slower module whose high-level computations guide the low-level module. An “outer loop” guides the inner loop to iteratively repeat its computations in order to refine and improve the model’s output.

HRMs were first introduced as an open source model described in a paper by Guan Wang, et al in June 2025. At size of only 27M parameters, the model beat dramatically larger models, such as OpenAI’s o3, Anthropic’s Claude 3.7 Sonnet and DeepSeek-R1—which has 671 billion parameters—on challenging benchmarks including ARC-AGI, Sudoku-Extreme and Maze-Hard.

The model itself is largely experimental, and the paper notes both practical constraints and unexplored avenues for future improvements. Nevertheless, its success—especially given its extreme data efficiency in training and a model size literally thousands of times smaller than most LLMs—make it a fascinating alternative approach to scaling reasoning systems. Subsequent research explorations, such as tiny recurrent models (TRMs), have achieved further advances by refining HRM’s basic approach and taking inspiration from the novel techniques it introduced.

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

How HRMs "reason"

Conventional reasoning models are LLMs that have been fine-tuned through reinforcement learning to output a step-by-step chain of thought (CoT) before providing a final response to the user. This process of “verbalizing” a reasoning process has been empirically shown to improve the model’s accuracy on math, coding and other complex logical tasks.

Despite the proven success of this approach, it has been argued that LLMs—even frontier reasoning LLMs—are not and will be a path to artificial general intelligence (AGI). On a neurological level, language is primarily a tool for communication, not thought.

Broadly speaking, an HRM’s more neuroscience-inspired approach is closer to how the human brain works through abstract problems. Unlike LLMs, HRMs reason internally without “verbalizing” this process. In more technical terms, whereas conventional reasoning models reason “out loud” in the token space, HRMs reason internally in latent space. LLMs “reason” by iteratively refining the actual words (tokens) they output, but an HRM works through problems by iteratively refining its hidden state—the model’s internal, thought-like intermediate computations that are used to (eventually) generate its final output.

Consider a recent time when you solved a complex problem: you may have had an inner monologue, but you probably didn’t literally verbalize your entire thought process in your head (or out loud) in neat, complete sentences. More likely, your brain sprang into action instinctively and wordlessly. From those initial, instinctive thoughts, some semblance of a higher-level plan emerged in your mind. You then mentally worked through the individual steps that strategy entailed, refining the overall plan as you went along. Eventually, you arrived at what felt like a satisfactory solution.

Whereas fine-tuning LLMs with reinforcement learning techniques can teach a model to generate outputs that imitate a thought process, HRMs—borrowing some principles from systems neuroscience—aim to replicate a thought process.

AI Academy

Become an AI expert

Gain the knowledge to prioritize AI investments that drive business growth. Get started with our free AI Academy today and lead the future of AI in your organization.

How hierarchical reasoning models (HRMs) work

As described in the “Hierarchical Reasoning Model” paper, the design of HRMs was influenced by the concept of “System 1” and “System 2” thinking, metaphorical terms coined by the late Nobel laureate Daniel Kahneman in his book Thinking, Fast and Slow to describe the different levels at which the human mind operates. “System 1” is fast, unconscious and intuitive. “System 2” thinking is slow, deliberate and logical. HRMs therefore implement a hierarchy in which the computations of a fast system that handles low-level computations are guided by a slower system that handles high-level planning.

HRMs vs. standard RNNs

In terms of machine learning principles, hierarchical reasoning models can be understood as a highly specialized form of recurrent neural networks (RNNs), with modifications that mitigate the practical shortcomings of standard RNNs. The most notable of those shortcomings is premature convergence: the tendency of RNNs to stop learning long before they’ve fully absorbed all the patterns and dependencies within training data sequences.

During model training, RNNs tend to quickly converge on model weights that aren’t sufficiently optimized to achieve accurate performance. This is generally due to vanishing gradients: after too many computational steps or too long of a sequence, the size of the model parameter updates computed during backpropagation get so small that they shrink to zero. The model weights reach a local equilibrium that reflects short-term patterns, preventing them from reaching a global equilibrium that fully, comprehensively reflects the patterns of the training data.

Many modifications of the standard RNN structure, such as long short-term memory (LSTM), have been proposed to rectify this flaw, but HRMs take a novel approach. The high-level, “System 2”-like module is designed to learn from each time the low-level module converges upon a local equilibrium. This update to the high-level system then provides a new context for the low-level system to operate within, allowing it to continue learning until it converges upon a new local equilibrium (at which point the high-level system is updated again).

The output of this “inner loop” is fed into an “outer loop” that learns how to iteratively improve upon its past outputs. All told, this setup takes advantage of the speed and simplicity of RNNs while enabling stabler, much “deeper” learning than would otherwise be possible with a recurrent network.

HRM architecture

The “inner loop” of the HRM model architecture comprises two recurrent modules. Both modules use an attention mechanism in a standard transformer block setup. One, the “L-module,” is designed to rapidly handle low level computations. The other, the “H-module,” is designed to handle long-term planning and higher-level reasoning.

The L-module essentially functions like a standard RNN, with its tendency to quickly zero in on short-term patterns and stop updating its hidden state. But whereas a standard RNN’s state update at timestep t is conditioned only by its hidden state at the previous timestep t-1, updates to the L-module’s hidden state zL—and therefore, the things it zeroes in on—are also conditioned by the H-module’s current hidden state zH.

The H-module’s hidden state changes much more slowly than that of the L-module. The inner loop operates in cycles of T timesteps: after the L-module has updated its hidden state zL T times, the H-module uses the final state of zL to updates zH. By timestep T, the L-module will often have already converged upon a local equilibrium and stopped updating. But because updates to zL are conditioned on the current value of zH, each update of zH establishes fresh context for the L-module. This initiates a new “convergence phase,” enabling the low-level module to keep learning.

In short, each time the L-module “solves” some short-term task, the H-module gets updated. That update to the H-module directs the L-module to solve some new short-term task. The H-module is, essentially, doing the long term planning—and the L-module is carrying out the smaller subtasks entailed by that long-term plan. This loop, comprising T updates of the L-module, is carried out N times. Both T and N are adjustable hyperparameters.

All told, the core HRM architecture powering the inner loop contains four learnable components:

  • An input network that converts tokens (representing the details of the puzzle that the model must solve) into vector embeddings.

  • The low-level recurrent module (L-module).

  • The high-level recurrent module (H-module), whose final hidden state after N cycles is passed to the output network.

  • An output network that takes the final value of zH and uses a softmax function to convert that hidden state into probabilities it uses to predict the values of output tokens (that collectively represent the puzzle’s solution).

HRM training data

Unlike reasoning LLMs, HRMs are not generalist models. They must be trained directly on the narrow task they are to solve. Though the paper reports that “HRM” achieved excellent performance on Sudoku, maze path-finding and ARC-AGI puzzles, the authors are really referring to three separate HRMs. One was trained on Sudoku, another on mazes, another on ARC-AGI puzzles.

Reasoning LLMs undergo their initial pretraining through self-supervised learning on massive amounts of unlabeled data points. They then undergo supervised fine-tuning (SFT) to learn to proper response structures, instruction tuning to learn how to complete tasks as desired, and then further fine-tuning through reinforcement learning to instill CoT reasoning. All told, this entails millions or billions of data points and weeks of training.

To create training data for HRMs, the authors used data augmentation. From a seed of just a handful original training examples (comprising labeled pairs of unsolved puzzles and their solutions), additional examples are created using small transformations (such as rotations, flips or color swaps). Each of the HRMs described in the paper were trained on only (roughly) 1,000 training examples created through applying such data augmentation to a small set of original samples.

Both approaches have their benefits. Reasoning LLMs are able to infer the rules of a given puzzle without explicit instructions, but require trillions of tokens of data to obtain that ability. HRMs can only perform the narrow task they were trained on, but can achieve comparable or even superior performance with dramatically fewer parameters and training examples.

HRM optimization

HRMs utilize a clever optimization trick to simplify and stabilize the process of optimizing model parameters, once again avoiding an inherent shortcoming of standard RNNs.

RNNs use a recurrence-specific form of backpropagation, called backpropagation through time (BPTT), to compute the gradients of how loss is accumulated at each timestep. As a standard RNN increases the amount of timesteps, BPTT inevitably runs into the problem of vanishing gradients.

To avoid this, as well as greatly reduce memory requirements, HRMs simplify their optimization objective. Instead of calculating the gradients at every timestep, HRMs perform BPTT only on the final state of the L-module and final state of the H-module. This relies on a straightforward assumption: if you know how the final output must change and optimize model weights to change to the final states of the L- and H-modules accordingly, everything else will take care of itself.

As with other elements of HRM, this takes inspiration from both neuroscience and anecdotal experience. Imagine a person (or model) trying to learn the block balancing game of Jenga. One need not learn to optimize every individual poke and prod of a block for every move. Assuming the blocks are set up in a certain way (the input) and that the move you made resulted in everything toppling over (the loss of your output), improving your technique requires a firm grasp of just two things:

  1. Which piece should I have moved instead? This is analogous to the optimal final state of the high-level module.
  2. How should I have manipulated that piece to make it safe to remove? This is analogous to the final state of the low-level module.

The paper’s authors found that this one-step approximation of BPTT works well enough that optimizing only for those two considerations is enough to establishing strong, stable learning dynamics.

HRM's outer loop: deep supervision

The HRM also employs an outer loop that enables the model to iteratively refine its final outputs in a process that the HRM paper’s authors call “deep supervision.” Subsequent research has suggested that the outer loop, more than the inner loop, is ultimately HRM’s most important component.

In standard supervised learning for neural networks, the model being trained is provided an input and performs a single forward pass to generate an output. A loss function measures the error of that output. Then, backpropagation is used to calculate the gradients of loss: how any change to any variable of the neural network would increase or decrease overall loss. Finally, some gradient descent algorithm uses that information to update model parameters. This iterative process then restarts, repeating until loss has minimized to some acceptable threshold.

Deep supervision doesn’t restart the entire process after the model generates that initial output through a single forward pass. Instead, it entails multiple forward passes, each of which is referred to as a “segment.” After each segment m, loss is calculated and model parameters are optimized accordingly—and the final hidden states of the H-module (zH) and L-module (zL) are then fed back into the model as the starting point for the next forward pass. This allows the model to iteratively refine its outputs, using what it has “learned” from the model parameter updates from the previous segment.

This process is repeated for M segments, in which the inner loop’s starting points for each subsequent segment m+1 are  zHmNT and  zLmNT : in other words, the final hidden state of the H-module and L-module after N inner loops of T timesteps during the prior segment m.          

Adaptive computational time (ACT)

To maintain the model’s efficiency, the creators of HRM added an adaptive computation time mechanism to help the model learn when a given output is good enough (or, conversely, if it should begin another refinement loop). To make this possible, the model incorporates Q-learning, a common type of reinforcement learning algorithm. 

After each segment, the final state of the high-level module, zH is passed not only to the output network, but also to another module they call the “Q-head,” with its own learnable weights. After zH is multiplied by the Q-head’s weights, it uses a sigmoid function—which squishes any input to a value between 0–1—that outputs a value for halt and a value for continue. If the value for halt is larger, the model generates a final output. If the value for continue is larger, the model begins another segment.

The overall loss function for the deep supervision process after each segment therefore combines two terms:

  • One part reflects loss for the task itself: how accurate was the model’s output?

  • The other reflects loss from the Q-head: if the model predicted a higher value for “halt” than for “continue”,” did it make the correct decision?

Over time, the model learns to spend more compute—that is, perform more refinement loops—on harder problems and spend less compute on easier problems. It’s worth noting that a similar idea, albeit with different implementation, was explored quite early in the history of transformers.

Importance of the outer loop

The ARC Prize, the non-profit organization that administers the ARC-AGI benchmark, performed an external analysis of HRMs and found that “the refinement outer loop is an essential driver for HRM’s performance.”

  • During inference, adding just one refinement loop nearly doubled the HRM’s accuracy (from 18.6% to 35.5%). Additional performance gains, albeit with significantly diminishing returns, came at 8 loops (38.1%) and 16 loops (39.0%). Even for a standard transformer model with no inner loop (but an otherwise identical architecture, model size and training pipeline to HRMs), adding outer refinement loops yielded similar performance increases.

  • The outer loop is also essential to training. Even when keeping the number of refinement loops at inference cost, adding just one refinement loop in training increased the model’s accuracy from 19% (with no refinement) to 32% (with 1 refinement). In fact, further experiments showed increasing refinement loops during training had significantly greater impact than increasing refinement loops during inference. With no refinement loops in neither training nor inference, the model scored 18.6%. With no refinement loops during inference and 16 refinement loops during training, the model scored 34.9%.

Conversely, the inner loop was shown to provide a relatively small example over an identically-sized model that replaces the H-module and L-module with the attention blocks of a standard transformer model. It’s uncertain if these findings are particular to the tasks in the ARC-AGI benchmark or universal to all reasoning tasks an HRM might handle.

Uncertainties and limitations of HRMs

Though hierarchical reasoning models introduce meaningful innovations to neural network architectures and training techniques that have already begun to influence deep learning research, the practical usefulness of HRMs themselves is presently uncertain.

Practicality

Relative to the massive reasoning LLMs, HRMs are drastically smaller, cheaper to train and cheaper to run—and can be trained on a very accessible amount of training examples. This runs contrary to the notion that frontier performance can only be achieved through massive models and training datasets beyond the reach of most researchers and organizations.

But the utility of mainstream reasoning models is their remarkable ability to generalize: they can perform highly specialized reasoning tasks in the context of understanding and carrying out a wide variety of natural language tasks and instructions. The extremely narrow capabilities of HRMS makes it much harder to integrate them into larger workflows. 

HRMs can only solve very specific kinds of puzzles it has seen during training. Even if a different puzzle format uses very similar rules and logic to one it has seen—so similar that a human good at one puzzle type would obviously be good at the other—an HRM couldn’t handle it. Improvements to the training pipeline that introduce a greater ability to leverage cross-task transfer learning would significantly increase HRMs’ practicality.

Interpretability

Though HRMs empirically demonstrate an ability to reason through problems to refine their outputs, the lack of a traceable “thought process” significantly reduces their interpretability. That said, it should be noted that interpretability is generally an issue in all AI systems trained through deep learning—and that research demonstrates that the reasoning traces an LLM provides to a user are not always faithful to to their true “thought process.”

Author

Dave Bergmann

Senior Staff Writer, AI Models

IBM Think

Related solutions
IBM® watsonx Orchestrate™ 

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Explore watsonx Orchestrate
Artificial intelligence solutions

Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
Artificial intelligence consulting and services

IBM Consulting AI services help reimagine how businesses work with AI for transformation.

Explore AI services
Take the next step

Whether you choose to customize pre-built apps and skills or build and deploy custom agentic services using an AI studio, the IBM watsonx platform has you covered.

Explore watsonx Orchestrate Explore watsonx.ai