Part I is designed as a run-through introduction to policy gradient at both token and sequence levels, and how they connect to each other.
When applying Reinforcement Learning (RL) to Large Language Models (LLMs), a fundamental design choice is whether to apply the policy gradient updates at token level or sequence level.
For REINFORCE algorithm, they are identical. For more advanced methods (e.g., TRPO, PPO, GRPO, GSPO), they diverge in ways that have significant practical implications, while both token-level and sequence-level formulations remain reasonable.
The classic entry point to policy gradients is the REINFORCE algorithm. The goal is to maximize the expected reward by adjusting the policy's parameters, $\theta$. For a language model, the policy $\pi_\theta$ generates a sequence $y$ of $T$ tokens (actions): $y = (y_1, y_2, ..., y_T)$.
The objective function $J(\theta)$ for a sequence-level reward $R(y)$ is:
$$ \begin{equation} J(\theta) = \mathbb{E}{y \sim \pi\theta} [R(y)] = \sum_y R(y) \cdot \pi_\theta(y) \end{equation}
$$
The gradient of this sequence-level objective is:
$$ \begin{equation} \nabla_\theta J(\theta) = \mathbb{E}{y \sim \pi\theta} [R(y)\cdot \frac{\nabla_\theta \pi_\theta(y)}{\pi_\theta(y)}]=\mathbb{E}{y \sim \pi\theta} [R(y)\cdot \nabla_\theta \log \pi_\theta(y)] \end{equation} $$
Since the probability of a sequence is the product of the probabilities of all its tokens, $\pi_\theta(y) = \prod_{t=1}^T \pi_\theta(y_t | y_{<t})$, its log-probability can be decomposed as $\log \pi_\theta(y) = \sum_{t=1}^T \log \pi_\theta(y_t| y_{<t}).$Then, the sequence-level objective (Equation 2) can be rewritten as:
$$ \begin{align}\nabla_\theta J(\theta) = \mathbb{E}{y_t \sim \pi\theta} \left[ \sum_{t=1}^T R(y) \cdot \nabla_\theta \log \pi_\theta(y_t\mid y_{<t}) \right].\end{align} $$
This form is exactly the gradient you would get, if you defined a token-level objective where the reward $R(y)$ is applied to each token in the sequence.
<aside> 💡
Key Takeaway: For the on-policy REINFORCE algorithm, the sequence-level and token-level formulations are mathematically equivalent.
</aside>
While the fundamental REINFORCE algorithm directly optimizes the expected reward, trust region methods (e.g., TRPO, PPO, …) optimize a different surrogate objective:
$$ \small\begin{equation} L(\theta) = \mathbb{E}{s \sim d^{\pi{\theta_{\rm old}}},\;a \sim \pi_{\theta_{\rm old}}(a|s)} \left[ \frac{\pi_\theta(a|s)}{\pi_{\theta_{\rm old}}(a|s)} \cdot A^{\pi_{\theta_{\rm old}}}(s,a) \right],\; \text{s.t.}\; D_{\rm KL}(\pi_\theta \,\|\, \pi_{\theta_{\rm old}}) \leq C. \end{equation} $$
Here: