Reinforcement learning in hyperbolic space for multi-step reasoning

Tao Xu; Dung-Yang Lee; Momiao Xiong; Tao Xu; Dung-Yang Lee; Momiao Xiong

doi:10.48130/stati-0025-0005

Multi-step reasoning is a fundamental challenge in artificial intelligence, with applications ranging from mathematical problem-solving to decision-making in dynamic environments. Reinforcement learning (RL) has shown promise in enabling agents to perform multi-step reasoning by optimizing long-term rewards. However, conventional RL methods struggle with complex reasoning tasks due to issues such as credit assignment, high-dimensional state representations, and stability concerns. Recent advancements in transformer architectures and hyperbolic geometry have provided novel solutions to these challenges. This paper introduces a new framework that integrates hyperbolic transformers into RL for multi-step reasoning. The proposed approach leverages hyperbolic embeddings to model hierarchical structures effectively. Theoretical insights, algorithmic details, and experimental results are presented, including Frontier Math and nonlinear optimal control problems. Compared to RL with vanilla transformer, the hyperbolic RL largely improves accuracy by 32%–44% on FrontierMath benchmark, 43%–45% on nonlinear optimal control benchmark, while achieving impressive reduction in computational time by 16%–32% on FrontierMath benchmark, 16%–17% on nonlinear optimal control benchmark. This work demonstrates the potential of hyperbolic transformers in reinforcement learning, particularly for multi-step reasoning tasks that involve hierarchical structures.

HTML

Introduction

Despite great progress in artificial intelligence, e.g., OpenAI's o1 and o3, DeepSeek-V3, and Alibaba's QwQ, solving reasoning tasks − particularly multi-step complex reasoning problems − remains a fundamental challenge due to high costs, proprietary nature, and complex architectures^[1]. Multi-step reasoning refers to the ability of AI systems to make logical connections between different pieces of context or different sources of information. Multi-step reasoning moves toward more human-like understanding and decision-making for AI systems. Its ability to interact with context, combine different information sources, and make logical connections is essential for artificial general intelligence (AGI)^[2]. AGI is the new frontier in artificial intelligence, where the aim is to create human-like cognitive abilities. Recently, OpenAI announced the Deep Research AI system, which combines advanced multi-step reasoning capabilities with extensive internet search and synthesis functions, marking a significant milestone on the path to AGI^[3].

Reinforcement learning (RL) that incorporates multi-step reasoning is widely seen as one of the promising components on the long road toward AGI, though it is not a silver bullet on its own. However, high costs, proprietary nature, scalability, integration of reasoning mechanisms, and complex architectures present great challenges^[4−9].

RL is a Markov decision process (MDP) and provides a mathematical formalism for sequential decision-making^[10,11]. It is observed that RL can acquire intelligent behaviors automatically. Decision actions are selected by the agent’s optimal policy to maximize the expected cumulative reward. Policy and reward are often approximated by neural networks. However, standard neural network architectures cannot efficiently deal with long-standing problems in RL, including partial observability^[12], high-dimensional state and action spaces^[13]. Recently, the transformer architecture demonstrated its superior performance. The essential idea behind the transformer architecture is to use a self-attention mechanism to capture long-range relationships within the data. The remarkable feature of the transformer is its excellent scalability. Transformers can be used to learn representations, model transition functions, learn reward functions, and learn policies^[14,15].

RL with transformers can perform single reasoning very well. It is also reported that either transformer or RL can perform multi-step reasoning^[16,17]. However, very few papers about using RL incorporating transformers for multi-step reasoning have been published.

Reasoning problems such as mathematical operations, coding, and logical reasoning involve a chain of thought, a tree of thought, and a graph of thought. Reason data are tree-like structured data. Embedding tree-like data, from hierarchies to taxonomies, is a well-defined problem for representing graph knowledge. Hyperbolic geometry provides a natural solution for embedding tree-like and hierarchical data, with demonstrated superior performance over Euclidean embeddings^[18]. Hypformer, developed last year, is a novel and complete hyperbolic transformer^[19]. It includes well-defined modules in hyperbolic space, such as linear transformation layers, LayerNorm layers, activation functions, dropout operations, and addresses the issue of quadratic time complexity of the existing hyperbolic self-attention module. Despite the impressive performance of various hyperbolic transformers, the papers integrating RL with hyperbolic transformers for multi-step reasoning are very limited.

The purpose of this paper is to introduce a novel multi-step reasoning large language model (LLM) with RL incorporating a complete hyperbolic transformer into it. The proposed hyperbolic transformer is designed to encode diverse sequences, such as entities, agents, and stacks of historical information, and to serve as an expressive predictor for the dynamics model. On the other hand, the developed hyperbolic transformers will integrate all subroutines into RL and act as a sequential decision-maker. To facilitate the development of the complete hyperbolic transformer-based RL for multi-step reasoning, applications are outlined in robotics, medicine, multi-step reasoning language modeling, combinatorial optimization, environmental sciences, and hyperparameter optimization. Limitations and challenges for future research are also addressed. This work is intended to stimulate discussions on hyperbolic transformer-based RL for multi-step reasoning, inspire further research, and facilitate the development of RL approaches for real-world applications.

[1]	Patil A. 2025. Advancing reasoning in large language models: promising methods and approache. arXiv 2502.03671v2 doi: 10.48550/arXiv.2502.03671 CrossRef Google Scholar
[2]	Multiworks. 2015. What is multi-hop reasoning. www.moveworks.com/us/en/resources/ai-terms-glossary/multi-hop-reasoning
[3]	Abnave. 2025. OpenAI’s deep research: a leap towards AGI. https://medium.com/@pratikabnave97/openais-deep-research-a-leap-towards-agi-e05339823715
[4]	Zhang Z, Lin P, Wang Z, Zhang Y, Xu JQZ. 2024. Initialization is critical to whether transformers fit composite functions by reasoning or memorizing. arXiv 2405.05409v5 doi: 10.48550/arXiv.2405.05409 CrossRef Google Scholar
[5]	De Asis K, Hernandez-Garcia J, Holland G, Sutton R. 2018. Multi-step reinforcement learning: a unifying algorithm. Proceedings of the AAAI Conference on Artificial Intelligence, 2−7 February 2018, New Orleans, Lousiana, USA, Vol. 32. Palo Alto, CA, USA: AAAI Press. doi: 10.1609/aaai.v32i1.11631
[6]	Deepseek-AI, Liu A, Feng B, Wang B, Wang B, et al. 2024. Deepseek-v2: a strong, economical, and efficient mixture-of-experts language model. CoRR 2024. https://openreview.net/forum?id=MwHAn6R7OS&referrer=%5Bthe%20profile%20of%20Bo%20Liu%5D(%2Fprofile%3Fid%3D~Bo_Liu17
[7]	Ye Y, Zhang T, Jiang W, Huang H. 2025. Process-supervised reinforcement learning for code generation. arXiv 2502.01715v1 doi: 10.48550/arXiv.2502.01715 CrossRef Google Scholar
[8]	Kim S, Kim S. 2024. System-2 reasoning via generality and adaptation. The First Workshop on System-2 Reasoning at Scale, NeurIPS'24: Sys2-Reasoning. https://openreview.net/group?id=NeurIPS.cc/2024/Workshop/Sys2-Reasoning#tab-accept-poster
[9]	Bereska L, Gavves E. 2024. Mechanistic interpretability for AI safety − a review. arXiv 2404.14082v3 doi: 10.48550/arXiv.2404.14082 CrossRef Google Scholar
[10]	Agarwal P, Rahman AA, St-Charles P-L, Prince SJ, Kahou SE. 2023. Transformers in reinforcement learning: a survey. arXiv 2307.05979v1 doi: 10.48550/arXiv.2307.05979 CrossRef Google Scholar
[11]	Li W, Luo H, Lin Z, Zhang C, Lu Z, Ye D. 2023. A survey on transformers in reinforcement learning. arXiv 2301.03044v3 doi: 10.48550/arXiv.2301.03044 CrossRef Google Scholar
[12]	Esslinger K, Platt R, Amato C. 2022. Deep transformer q-networks for partially observable reinforcement learning. arXiv 2404.14082v3 doi: 10.48550/arXiv.2206.01078 CrossRef Google Scholar
[13]	Barto AG, Mahadevan S. 2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13:341−79 doi: 10.1023/A:1025696116075 CrossRef Google Scholar
[14]	Chen C, Wu YF, Yoon J, Ahn S. 2022. TransDreamer: reinforcement learning with transformer world models. arXiv 2202.09481v2 doi: 10.48550/arXiv.2202.09481 CrossRef Google Scholar
[15]	Ganea O, Bécigneul G, Hofmann T. 2018. Hyperbolic entailment cones for learning hierarchical embeddings. Proceedings of the 35^th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10−15 July 2018. PMLR. pp. 1646−55 https://proceedings.mlr.press/v80/ganea18a.html
[16]	Ye J, Yao Z, Huang Z, Pan L, Liu J, et al. 2025. How does transformer learn implicit reasoning? arXiv 2505.23653v1 doi: 10.48550/arXiv.2505.23653 CrossRef Google Scholar
[17]	Wang Z, Wang Y, Zhang Z, Zhou Z, Jin H, et al. 2024. Understanding the language model to solve the symbolic multi-step reasoning problem from the perspective of buffer mechanism. arXiv 2405.15302v3 doi: 10.48550/arXiv.2405.15302 CrossRef Google Scholar
[18]	Liu G, Ji K, Zheng R, Wu Z, Dun C, et al. 2024. Enhancing multi-step reasoning abilities of language models through direct q-function optimization. arXiv 2410.09302v2 doi: 10.48550/arXiv.2410.09302 CrossRef Google Scholar
[19]	Yang M, Verma H, Zhang DC, Liu J, King I, et al. Hypformer: eploring efficient transformer fully in hyperbolic space. Proceedings of the 30^th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 25−29 August 2024, Barcelona, Spain. USA: ACM. pp. 3770−81 doi: 10.1145/3637528.3672039
[20]	Khrulkov V, Mirvakhabova L, Ustinova E, Oseledets I, Lempitsky V. 2020. Hyperbolic image embeddings. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). June 13−19, 2020, Seattle, WA, USA. USA: IEEE. pp. 6417−27 doi: 10.1109/cvpr42600.2020.00645
[21]	Tifrea A, Bécigneul G, Ganea O-E. 2018. Poincaré glove: hyperbolic word embeddings. arXiv 1810.06546v2 doi: 10.48550/arXiv.1810.06546 CrossRef Google Scholar
[22]	Nickel M, Kiela D. 2018. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. Proceedings of the 35^th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, 10−15 July 2018. PMLR. pp. 3779−88 https://proceedings.mlr.press/v80/nickel18a.html
[23]	Meng F, Yao Z, Zhang M. 2025. TransMLA: multi-head latent attention is all you need. arXiv 2502.07864v5 doi: 10.48550/arXiv.2502.07864 CrossRef Google Scholar
[24]	Su J, Ahmed M, Lu Y, Pan S, Bo W, et al. 2024. Roformer: enhanced transformer with rotary position embedding. Neurocomputing 568:127063 doi: 10.1016/j.neucom.2023.127063 CrossRef Google Scholar
[25]	Wembo. 2025. DeepSeekMoE: bridging efficiency and capacity in large language models using DeepSeek model from China https://levelup.gitconnected.com/deepseekmoe-bridging-efficiency-and-capacity-in-large-language-models-using-deepseek-model-from-dbd4e852a637
[26]	Grootendorst M. 2024. A visual guide to mixture of experts (MoE) https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts

Backbone	Updates to MAE < 10⁻⁶	Final MAE × 10⁻⁶	Wall-clock
Vanilla-T	10,200 ± 800	6.2	1
Hyperbolic-T	6,600 ± 610	3.1	0.84 ×

Backbone	Updates to hit 11	Miss.pred.rate (%)	Wall-clock
Vanilla-T	10,400 ± 800	0.46	1.00
Hyper-T	6,900 ± 600	0.31	0.83

Backbone	Updates to MAE < 10⁻⁶	Final MAE × 10⁻⁶	Wall-clock
Vanilla-T	12,800 ± 900	6.4	1.00
Hyper-T	8,200 ± 670	3.6	0.83

Backbone	Updates to MAE < 10⁻⁶	Final MAE ×10⁻⁶	Wall-clock
Vanilla-T	10,500 ± 800	6.0	1.00
Hyper-T	6,800 ± 620	3.3	0.84

Supplementary Data 1 A list of symbols.
Supplementary Data 2 Sample Problems in FrontierMath.

{{lists.name}}

Reinforcement learning in hyperbolic space for multi-step reasoning

Abstract

Supplementary information

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors

Reinforcement learning in hyperbolic space for multi-step reasoning

HTML

Model framework

Methods for multi-step reasoning in RL

Merging transformer reasoning with multi-step RL

Representation learning high-dimensional data in RL

Transformer architecture

Converting Euclidean transformer to a hyperbolic transformer for general representation

Notation and setup

Exponential map to the Poincaré ball

Hyperbolic versions of basic operations

Converting the input embedding

Converting LayerNorm

Converting multi-head attention

Converting the FFN

Hyperbolic residual & LayerNorm

Hyperbolic (Poincaré ball) transformer for transition function

Hyperbolic transformer architecture

Input Embedding (Hyperbolic)

Full architecture: hyperbolic transformer for transition function

Group relative policy optimization (GRPO) incorporating hyperbolic transformer

Background: MLA and DeepSeekMoE in Euclidean space

MLA

Low-rank key-value joint compression

Integrates mixture of experts (MoE) with MLA

Basic architecture

Auxiliary loss for load balance

Converting MLA and DeepSeekMoE to hyperbolic space

Hyperbolic MLA

Hyperbolic DeepSeekMoE

Hyperbolic transformer as a policy

GRPO

An interesting 'aha moment' of an intermediate version of DeepSeek-R1-Zero: the scalar root-finding benchmark

A set of representative example FrontierMath problems

Prime field continuous extensions

The damped Van-der-Pol optimal-control problem

A synthetic but internally-consistent benchmark for the unicycle–vehicle energy-minimization problem

Catalog

Export File

Citation

Format

Content