*Liyuan Liu** Feng Yao* Dinghuai Zhang Chengyu Dong Jingbo Shang Jianfeng Gao
*: Equal Contributions (Work in Progress)
First published on August 11, 2025 | GitHub: https://github.com/yaof20/Flash-RL | https://pypi.org/project/flash-llm-rl/
<aside>
TL;DR
Rollout generation is a primary bottleneck in RL training, taking up ~70% of total training time in DAPO-32B. FlashRL provides the first open-sourced & working RL recipe that applies quantized rollout generation while preserving downstream performance via the TIS technique. It can be easily used via pip install flash-llm-rl
and supports both INT8 and FP8 quantization for both the latest GPUs (H100) and older ones (A100).
</aside>
Figure 1. Left: Throughput speedup ratio. FP8 results are measured on H100; INT8 results are tested on H100 and A100. Results are obtained with various response lengths. Right: AIME accuracy of Qwen2.5-32B model with BF16 rollouts and INT8 rollouts. All runs uses BF16 FSDP training backends. “TIS” denotes the truncated importance sampling technique we propose. [wandb]
As shown by the “$\textcolor{grey}{\small\cdot\cdot\cdot}$” lines in Figures 1 & 2, employing rollout quantization (FP8, INT8) without TIS incurs a significant performance drop compared to BF16 rollouts.
This is expected as it amplifies the ****rollout–training mismatch that rollouts are sampled from quantized policy $\textcolor{red}{\pi_{\text{int8}}}$, but gradients are computed using high-precision policy $\textcolor{blue}{\pi_{\text{bf16}}}$:
$$ \small{ \underbrace{\mathbb{E}{a\sim\textcolor{blue}{\pi{\text{bf16}}}(\theta_{\mathrm{old}})}}{\text{int8 rollout: }\textcolor{blue}{\pi{\text{bf16}}} \to \textcolor{red}{\pi_{\text{int8}}}} \Bigl[ \nabla_\theta \min\Bigl( \frac{\textcolor{blue}{\pi_{\text{bf16}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{bf16}}}(a, \theta_{\mathrm{old}})}\,\hat A, \;\mathrm{clip}\bigl(\frac{\textcolor{blue}{\pi_{\text{bf16}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{bf16}}}(a, \theta_{\mathrm{old}})},\,1-\epsilon,\,1+\epsilon\bigr)\,\hat A \Bigr) \Bigr] }. $$
This mismatch makes RL more off-policy, undermining the effectiveness of RL training.
To the best of our knowledge, FlashRL provides the first open-sourced & working RL recipe that employs quantized rollout without sacrificing downstream performance.
What is the secret sauce?
Flash-LLM-RL
package, which patches vLLM to support such functionality.Figure 2. Left & Middle: GSM8K accuracy for RL LLM training with quantized rollout generation. Note that TIS is crucial for mitigating the distribution gap. Right: KL divergence between $\pi_{\text{fsdp}}$ and ${\pi_{\text{vllm}}}$. Note that the KL divergence of INT8 rollouts is larger than the KL of FP8 rollouts. [int8][fp8]