*Liyuan Liu** Feng Yao* Dinghuai Zhang Chengyu Dong Jingbo Shang Jianfeng Gao

*: Equal Contributions (Work in Progress)

First published on August 11, 2025 | GitHub: https://github.com/yaof20/Flash-RL | https://pypi.org/project/flash-llm-rl/

<aside>

TL;DR

Rollout generation is a primary bottleneck in RL training, taking up ~70% of total training time in DAPO-32B. FlashRL provides the first open-sourced & working RL recipe that applies quantized rollout generation while preserving downstream performance via the TIS technique. It can be easily used via pip install flash-llm-rl and supports both INT8 and FP8 quantization for both the latest GPUs (H100) and older ones (A100).

</aside>

dapo_32b_w_speedup.png

Figure 1. Left: Throughput speedup ratio. FP8 results are measured on H100; INT8 results are tested on H100 and A100. Results are obtained with various response lengths. Right: AIME accuracy of Qwen2.5-32B model with BF16 rollouts and INT8 rollouts. All runs uses BF16 FSDP training backends. “TIS” denotes the truncated importance sampling technique we propose. [wandb]

Rollout Quantization May Hurt Performance

As shown by the “$\textcolor{grey}{\small\cdot\cdot\cdot}$” lines in Figures 1 & 2, employing rollout quantization (FP8, INT8) without TIS incurs a significant performance drop compared to BF16 rollouts.

This is expected as it amplifies the ****rollout–training mismatch that rollouts are sampled from quantized policy $\textcolor{red}{\pi_{\text{int8}}}$, but gradients are computed using high-precision policy $\textcolor{blue}{\pi_{\text{bf16}}}$:

$$ \small{ \underbrace{\mathbb{E}{a\sim\textcolor{blue}{\pi{\text{bf16}}}(\theta_{\mathrm{old}})}}{\text{int8 rollout: }\textcolor{blue}{\pi{\text{bf16}}} \to \textcolor{red}{\pi_{\text{int8}}}} \Bigl[ \nabla_\theta \min\Bigl( \frac{\textcolor{blue}{\pi_{\text{bf16}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{bf16}}}(a, \theta_{\mathrm{old}})}\,\hat A, \;\mathrm{clip}\bigl(\frac{\textcolor{blue}{\pi_{\text{bf16}}}(a, \theta)}{\textcolor{blue}{\pi_{\text{bf16}}}(a, \theta_{\mathrm{old}})},\,1-\epsilon,\,1+\epsilon\bigr)\,\hat A \Bigr) \Bigr] }. $$

This mismatch makes RL more off-policy, undermining the effectiveness of RL training.

Secrete Sauce of FlashRL

To the best of our knowledge, FlashRL provides the first open-sourced & working RL recipe that employs quantized rollout without sacrificing downstream performance.

What is the secret sauce?

  1. Rollout-Training Mismatch Fix. We apply *truncated importance sampling* (TIS) to mitigate the gap between rollout and training*.* As shown by the solid lines in Figures 1 & 2, TIS pushes quantized-rollout training to the same performance level as BF16 rollout training with TIS —and even surpasses naive BF16 rollout training without TIS.
  2. Online Quantization Support. Existing inference engines such as vLLM are optimized for LLM serving and offer limited support for model quantization with parameter updates. We provide Flash-LLM-RL package, which patches vLLM to support such functionality.

gsm8k_int8_fp8_TIS.png

Figure 2. Left & Middle: GSM8K accuracy for RL LLM training with quantized rollout generation. Note that TIS is crucial for mitigating the distribution gap. Right: KL divergence between $\pi_{\text{fsdp}}$ and ${\pi_{\text{vllm}}}$. Note that the KL divergence of INT8 rollouts is larger than the KL of FP8 rollouts. [int8][fp8]

How Fast & Well Can FlashRL Go?