Part II discusses the impact of the learner-sampler gap on policy gradient methods and how to adapt classical RL results accordingly. The discussion and notation follow Part I.
The learner-sampler mismatch is distinct from conventional off-policy RL because it arises from system differences rather than parameter differences. This discrepancy means some classical RL results cannot be applied directly.
The core of the problem is that standard policy gradient objectives assume that data is sampled from the same policy as used in learning objective. When it's not, we run into issues.
In following discussions, we focus on the token-level or the multi-turn sequence-level objective, which more closely resembles Markov Decision Process (MDP).
The ideal/conventional objective, which forms the basis for algorithms like TRPO and PPO, uses samples generated by $\pi^{\rm learner}{\theta{\rm old}}$ to update $\pi^{\rm learner}_{\theta}$.
$$ \begin{align}\small \mathbb{E}{s \sim d^{\pi^{\rm learner}{\theta_{\rm old}}},~a \sim \pi^{\rm learner}{\theta{\rm old}}(a|s)} \left[ \frac{\pi^{\rm{learner}}{\theta}(a|s)}{\pi^{\rm learner}{\theta_{\rm old}}(a|s)} A^{\pi^{\rm learner}{\theta{\rm old}}}(s,a) \right],\;\text{s.t.} \; D_{KL}(\pi^{\rm{learner}}{\theta}\|\pi^{\rm learner}{\theta_{\rm old}}) \leq C. \end{align} $$
As discussed in Part I, in the conventional RL setting (i.e., no learner-sampler mismatch), this objective is not equivalent to $J(\theta)$, but instead is a lower bound for $J(\theta) - J(\theta_{\rm old})$.
Using a specialized backend usually has the potential to accelerate rollout generation greatly. A naive approach might be to use data from the sampler policy, while gradient computation is still carried out via the learner policy.
$$ \begin{align}\small \mathbb{E}{s \sim d^{\textcolor{red}{\pi^{\rm sampler}{\theta_{\rm old}}}},\;a \sim \textcolor{red}{\pi^{\rm sampler}{\theta{\rm old}}}(a|s)} \left[ \frac{\pi^{\rm{learner}}{\theta}(a|s)}{\pi^{\rm learner}{\theta_{\rm old}}(a|s)} A^{\textcolor{red}{\pi^{\rm sampler}{\theta{\rm old}}}}(s,a) \right],\;\text{s.t.} \; D_{KL}(\pi^{\rm{learner}}{\theta}\|\pi^{\rm learner}{\theta_{\rm old}}) \leq C \end{align} $$
Similar to the bandit formulation, the ratio $\frac{\pi^{\rm{learner}}{\theta}}{\pi^{\rm learner}{\theta_{\rm old}}}$ does not correctly re-weight samples drawn from $\textcolor{red}{\pi^{\rm sampler}{\theta{\rm old}}}$, and can be fixed via importance sampling. Moreover, different from the bandit formulation, the learner-sampler mismatch creates two additional disparities:
In certain cases when estimation for $A^{\textcolor{red}{\pi^{\rm sampler}{\theta{\rm old}}}}(s,a)$ can be used to estimate $A^{\pi^{\rm learner}{\theta{\rm old}}}(s,a)$ unbiasedly, the two disparities here can be fixed via a term similar to trajectory-level importance ratio. In general, these two disparities cannot be corrected by importance sampling, making existing theoretical analyses void.
That said, this naive adoption works well when the mismatch is small, and we will discuss a bit later.
<aside> 💡
Key Takeaway: Importance sampling can neutralize learner-sampler mismatch in the single-turn sequence-level formulation but not in the token-level or multi-turn sequence-level formulation.
</aside>