
NVIDIA NeMo-RL Makes use of GRPO for Superior Reinforcement Studying
Peter Zhang
Jul 10, 2025 06:07
NVIDIA introduces NeMo-RL, an open-source library for reinforcement studying, enabling scalable coaching with GRPO and integration with Hugging Face fashions.
NVIDIA has unveiled NeMo-RL, a cutting-edge open-source library designed to boost reinforcement studying (RL) capabilities, in line with NVIDIA’s official weblog. The library helps scalable mannequin coaching, starting from single-GPU prototypes to huge thousand-GPU deployments, and integrates seamlessly with common frameworks like Hugging Face.
NeMo-RL’s Structure and Options
NeMo-RL is part of the broader NVIDIA NeMo Framework, identified for its versatility and high-performance capabilities. The library consists of native integration with Hugging Face fashions, optimized coaching, and inference processes. It helps common RL algorithms similar to DPO and GRPO and employs Ray-based orchestration for effectivity.
The structure of NeMo-RL is designed with flexibility in thoughts. It helps varied coaching and rollout backends, guaranteeing that high-level algorithm implementations stay agnostic to backend specifics. This design permits for the seamless scaling of fashions with out the necessity for algorithm code modifications, making it best for each small-scale and large-scale deployments.
Implementing DeepScaleR with GRPO
The weblog publish explores the applying of NeMo-RL to breed a DeepScaleR-1.5B recipe utilizing the Group Relative Coverage Optimization (GRPO) algorithm. This entails coaching high-performing reasoning fashions, similar to Qwen-1.5B, to compete with OpenAI’s O1 benchmark on the AIME24 educational math problem.
The coaching course of is structured in three steps, every rising the utmost sequence size used: beginning at 8K, then 16K, and eventually 24K. This gradual improve helps handle the distribution of rollout sequence lengths, optimizing the coaching course of.
Coaching Course of and Analysis
The coaching setup entails cloning the NeMo-RL repository and putting in needed packages. Coaching is performed in phases, with the mannequin evaluated repeatedly to make sure efficiency benchmarks are met. The outcomes demonstrated that NeMo-RL achieved a coaching reward of 0.65 in solely 400 steps.
Analysis on the AIME24 benchmark confirmed that the educated mannequin surpassed OpenAI O1, highlighting the effectiveness of NeMo-RL when mixed with the GRPO algorithm.
Getting Began with NeMo-RL
NeMo-RL is obtainable for open-source use, offering detailed documentation and instance scripts on its GitHub repository. This useful resource is right for these trying to experiment with reinforcement studying utilizing scalable and environment friendly strategies.
The library’s integration with Hugging Face and its modular design make it a strong device for researchers and builders looking for to leverage superior RL methods of their initiatives.
Picture supply: Shutterstock