The path to artificial general intelligence (AGI) using large language models (LLMs) is increasingly focused on refining existing models rather than simply scaling compute power. A crucial question is the efficacy of current reinforcement learning (RL) techniques in achieving AGI. While proponents like Sam Altman acknowledge progress, the focus has shifted from solely pre-training scaling to enhancing performance post-training.
This analysis delves into RL with verifiable rewards (RLVR) methods, specifically GRPO, examining its strengths and limitations. While RL, exemplified by AlphaZero, can theoretically uncover novel solutions, GRPO functions differently. It reinforces outputs considered “good” while discouraging “bad” ones, but operates within the constraints of the base model’s existing capabilities. This suggests that GRPO primarily reveals hidden potential within the base model rather than creating entirely new knowledge.
GRPO also lacks an internal exploration mechanism like AlphaZero’s Monte Carlo Tree Search (MCTS), potentially leading to stagnation in local optima. However, the surprising out-of-distribution (OOD) generalization demonstrated by reasoning models indicates that they acquire skills from the base model, not just overfit the Chain-of-Thought (CoT) training data.
The central question is whether a base model, possessing the same prior knowledge as Einstein, could independently discover relativity. This hinges on whether relativity theory exists as a non-zero probability output within the base model’s possible generations. Given the stochastic nature of LLMs, and that virtually all valid English sequences have a certain probability, this is conceivable, albeit perhaps improbable. An RL system with sufficient exploration capabilities to avoid local minima traps could potentially unlock such a reasoning pathway.
Although GRPO may have limitations, the author proposes that, with extensive training, it might develop emergent exploration skills, effectively prompting itself to think creatively. The extent to which current methodologies can lead to AGI remains an open question. However, the possibility remains that GRPO and RLVR could pave the way towards AGI by simulating exploration and exploiting the fact that novel discoveries have a non-zero probability of appearing in the base model’s output.
Photo by Chris wade NTEZICIMPA on Pexels