*Feng Yao Liyuan Liu*** **Dinghuai Zhang Chengyu Dong Jingbo Shang Jianfeng Gao**
*: Equal Contributions (Work in Progress)
Last Updated on October 13, 2025 | First Published on August 5, 2025 | [ :github:: Github]
<aside>
TL;DR
In modern RL training frameworks (e.g., VeRL), different implementations are used for rollout generation (e.g., vLLM) and model training (e.g., FSDP). Here, we show the implementation gap implicitly turns the on-policy RL to be off-policy, and discuss a simple yet effective importance sampling technique for handling such discrepancy.
</aside>
Figure 1. Left: Token probability differences brought by the mismatch problem. Right: Performance comparison between normal RL training and training after fixing the mismatch problem. Experiments are conducted on Qwen2.5-32B dense model using 4 nodes of 8 H100 GPUs. [wandb log]
<aside> π‘
[News] Added Policy Gradient, Sequence, and Tokenβ Part I & Part II. 2025/10/13
[News] Blog updated with Rollout-Training Mismatch Analysis section. 2025/09/24
[News] Blog updated with TIS Analysis section. 2025/08/22
[News] slime has integrated TIS. [GitHub] [News] VeRL has integrated TIS. [GitHub][Example]
[News] OAT has verified and implemented TIS. [GitHub] [Tweet from OAT]
[News] SkyRL has integrated TIS. [GitHub] [Tweet from SkyRL]
[News] REINFORCE++ verified TIS in Tool-Integrated-Reasoning (TIR) setting. [Blog]
[News] OpenRLHF has integrated TIS. [GitHub]
</aside>
For simplicity, we use the REINFORCE algorithm as an example, which supposedly updates the policy $\pi$ β an LLM parameterized by $\theta$ β via:
$$ \theta \gets \theta + \mu \cdot \mathbb{E}{\underbrace{a \sim{\pi}(\theta)}{rollout}} [R(a)\cdot \underbrace{\nabla_\theta \log {\pi}(a, \theta)}_{\tiny{training}}]. $$
In practice, rollout generation is expensive and modern RL frameworks (e.g., VeRL) typically employ highly optimized inference engines (e.g., vLLM, SGLang) to boost throughput, while using a separate backend (e.g., FSDP, Megatron) for model training. Such hybrid design makes the updating:
$$ \theta \gets \theta + \mu \cdot \mathbb{E}{a \sim \textcolor{red}{\pi{\text{sampler}}}(\theta)} [R(a)\cdot \nabla_\theta \log \textcolor{blue}{\pi_{\text{learner}}}(a, \theta)]. $$
Here, we use $\pi_{\rm sampler}$ to represent the model loaded with the inference engine (e.g., vLLM, SGLang) and $\pi_{\rm learner}$ to denote the same model instantiated with the training backend (e.g., FSDP, Megatron). Unless unspecified, our experiments use vLLM and FSDP as sampler and learner backends.
There is unexpected rollout-training mismatch observed. ****As shown in Figure 1, despite $\textcolor{blue}{\pi_{\text{fsdp}}}$ and $\textcolor{red}{\pi_{\text{vllm}}}$ sharing the same model parameters $\theta$, they can produce significantly different token probabilities. For certain tokens $a$, they even yield contradictory predictions, i.e., $\textcolor{red}{\pi_{\text{vllm}}}(a, \theta)\!=\!1$ and $\textcolor{blue}{\pi_{\text{fsdp}}}(a, \theta)\!=\!0$. This unexpected behavior implicitly breaks the on-policy assumption, secretly making the RL training become off-policy.
Does higher-precision vLLM help? We first hypothesized that vLLM is the root cause, and thus we patched vLLM to address two commonly suspected contributors to the mismatch problem.
Inaccessible true sampling probabilities: vLLM v1 engine does not support directly returning the adjusted probabilities used for sampling, introducing an additional gap.
β Our patch forces vLLM to return the actual probabilities used for sampling [upstreamed].
Backend numerical differences: vLLM lm_headβs precision does not match that of HuggingFace transformers, which is also denoted in the MiniMax-M1 technical report.
β Our patch provides the option to force vLLM casting lm_head to fp32.