The Markovian Thinker

Best AI papers explained - A podcast by Enoch H. Kang - Saturdays

Categories:

Reinforcement learning (RL) methods for training Large Language Models (LLMs) to produce long chains of thought (LongCoT) are constrained by the standard thinking environment, where the state is unbounded, leading to quadratic computational costs as reasoning length increases. This paper propose Markovian Thinking, a paradigm where the reasoning policy conditions only on a constant-size state, effectively decoupling thinking length from context size and yielding linear compute and constant memory benefits. This concept is instantiated with Delethink, an RL environment that organizes reasoning into fixed-size chunks. At the chunk boundaries, the context resets, and the policy must learn to write a concise textual carryover sufficient for seamless continuation of the reasoning process in the next chunk. Models trained with Delethink, such as R1-Distill 1.5B, match or exceed the performance of LongCoT-RL while significantly reducing computational overhead, demonstrating superior test-time scaling capability far beyond their training budget. They emphasize that redesigning the thinking environment is a powerful lever for achieving efficient and scalable reasoning in LLMs.