*Feng Yao Liyuan Liu*** Dinghuai Zhang Chengyu Dong Jingbo Shang Jianfeng Gao

*: Equal Contributions (Work in Progress)

First Published on August 5, 2025 | https://github.com/yaof20/verl/tree/flash-rl/recipe/flash_rl

<aside>

TL;DR

In modern RL training frameworks (e.g., VeRL), different implementations are used for rollout generation (e.g., vLLM) and model training (e.g., FSDP). Here, we show the implementation gap implicitly turns the on-policy RL to be off-policy, and discuss a simple yet effective importance sampling technique for handling such discrepancy.

</aside>

dapo_32b.png

Figure 1. Left: Token probability differences brought by the mismatch problem. Right: Performance comparison between normal RL training and training after fixing the mismatch problem. Experiments are conducted on Qwen2.5-32B dense model using 4 nodes of 8 H100 GPUs. [wandb log]

<aside> 💡

[News] OpenRLHF has implement our algorithm: https://github.com/OpenRLHF/OpenRLHF/releases/tag/v0.8.9

</aside>

The Mismatch Problem

For simplicity, we use the REINFORCE algorithm as an example, which supposedly updates the policy $\pi$ — an LLM parameterized by $\theta$ — via:

$$ \theta \gets \theta + \mu \cdot \mathbb{E}{\underbrace{a \sim{\pi}(\theta)}{rollout}} [R(a)\cdot \underbrace{\nabla_\theta \log {\pi}(a, \theta)}_{\tiny{training}}]. $$

In practice, rollout generation is expensive and modern RL frameworks (e.g., VeRL) typically employ highly optimized inference engines (e.g., vLLM) to boost throughput, while using a separate backend (e.g., FSDP) for model training. Such hybrid design makes the updating:

$$ \theta \gets \theta + \mu \cdot \mathbb{E}{a \sim \textcolor{red}{\pi{\text{vllm}}}(\theta)} [R(a)\cdot \nabla_\theta \log \textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)]. $$

Here, we use $\pi_{fsdp}$ to denote the model instantiated with the training backend (e.g., FSDP, Megatron) and $\pi_{vllm}$ to represent the same model loaded with the inference engine (e.g., vLLM, SGLang).

There is clearly an unexpected rollout-training mismatch observed. ****As shown in Figure 1, despite $\textcolor{blue}{\pi_{\text{fsdp}}}$ and $\textcolor{red}{\pi_{\text{vllm}}}$ sharing the same model parameters $\theta$, they can produce significantly different token probabilities. For certain tokens $a$, they even yield contradictory predictions — $\textcolor{red}{\pi_{\text{vllm}}}(a, \theta)\!=\!1$ and $\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)\!=\!0$. This unexpected behavior implicitly breaks the on-policy assumption, secretly making the RL training become off-policy.

How to Fix It?

Mitigate the system-level mismatch

Does higher-precision vLLM help? We hypothesized that vLLM is the root cause, and thus we patched vLLM to address two commonly suspected contributors to the mismatch problem.