*Feng Yao Liyuan Liu*** Dinghuai Zhang Chengyu Dong Jingbo Shang Jianfeng Gao
*: Equal Contributions (Work in Progress)
First Published on August 5, 2025 | https://github.com/yaof20/verl/tree/flash-rl/recipe/flash_rl
<aside>
TL;DR
In modern RL training frameworks (e.g., VeRL), different implementations are used for rollout generation (e.g., vLLM) and model training (e.g., FSDP). Here, we show the implementation gap implicitly turns the on-policy RL to be off-policy, and discuss a simple yet effective importance sampling technique for handling such discrepancy.
</aside>
Figure 1. Left: Token probability differences brought by the mismatch problem. Right: Performance comparison between normal RL training and training after fixing the mismatch problem. Experiments are conducted on Qwen2.5-32B dense model using 4 nodes of 8 H100 GPUs. [wandb log]
<aside> 💡
[News] OpenRLHF has implement our algorithm: https://github.com/OpenRLHF/OpenRLHF/releases/tag/v0.8.9
</aside>
For simplicity, we use the REINFORCE algorithm as an example, which supposedly updates the policy $\pi$ — an LLM parameterized by $\theta$ — via:
$$ \theta \gets \theta + \mu \cdot \mathbb{E}{\underbrace{a \sim{\pi}(\theta)}{rollout}} [R(a)\cdot \underbrace{\nabla_\theta \log {\pi}(a, \theta)}_{\tiny{training}}]. $$
In practice, rollout generation is expensive and modern RL frameworks (e.g., VeRL) typically employ highly optimized inference engines (e.g., vLLM) to boost throughput, while using a separate backend (e.g., FSDP) for model training. Such hybrid design makes the updating:
$$ \theta \gets \theta + \mu \cdot \mathbb{E}{a \sim \textcolor{red}{\pi{\text{vllm}}}(\theta)} [R(a)\cdot \nabla_\theta \log \textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)]. $$
Here, we use $\pi_{fsdp}$ to denote the model instantiated with the training backend (e.g., FSDP, Megatron) and $\pi_{vllm}$ to represent the same model loaded with the inference engine (e.g., vLLM, SGLang).
There is clearly an unexpected rollout-training mismatch observed. ****As shown in Figure 1, despite $\textcolor{blue}{\pi_{\text{fsdp}}}$ and $\textcolor{red}{\pi_{\text{vllm}}}$ sharing the same model parameters $\theta$, they can produce significantly different token probabilities. For certain tokens $a$, they even yield contradictory predictions — $\textcolor{red}{\pi_{\text{vllm}}}(a, \theta)\!=\!1$ and $\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)\!=\!0$. This unexpected behavior implicitly breaks the on-policy assumption, secretly making the RL training become off-policy.
Does higher-precision vLLM help? We hypothesized that vLLM is the root cause, and thus we patched vLLM to address two commonly suspected contributors to the mismatch problem.
Inaccessible true sampling probabilities: vLLM v1 engine does not support directly returning the adjusted probabilities used for sampling, introducing an additional gap.
→ Our patch forces vLLM to return the actual probabilities used for sampling.