<aside> 🧠
TL;DR: We propose a hybrid post-training method that blends teacher and student logits at every decoding step, producing rollouts that are neither fully on-policy nor fully off-policy. Combined with importance-sampling correction and a decaying mixing coefficient, this logit fusion strategy lets smaller models absorb expert reasoning while retaining their own exploratory capabilities.
</aside>
<aside> 👥
Juzheng Zhang$^1$ · Abhimanyu Hans$^1$ · John Kirchenbauer$^1$ Micah Goldblum$^2$ · Ashwinee Panda$^1$ · Tom Goldstein$^1$
$^1$University of Maryland · $^2$Columbia University
:github:: Github link | Feb 25, 2026
</aside>

Large language models (LLMs) have made remarkable strides in complex reasoning, yet training them effectively remains a persistent challenge. The two dominant paradigms each carry significant drawbacks:
<aside> ⚠️
The SFT-then-RL Dilemma
The dominant post-training pipeline applies SFT before RL in a two-stage fashion. Yet this paradigm does not consistently outperform pure RL. SFT can disrupt established patterns and induce overfitting, undermining the benefits of subsequent RL.
</aside>
Incorporating off-policy expert data into the on-policy RL loop is a promising strategy: it preserves RL's exploratory drive while injecting expert knowledge to unlock new reasoning abilities. Several families of approaches have emerged:
<aside> 💡
Key Insight. None of the existing methods fundamentally modify the data distribution. We address the on-policy / off-policy mixing challenge directly at the data level.
</aside>
By constructing a mixed behavior policy that blends student and teacher distributions and sampling from this mixture, we learn from data that reflects both the student's exploration and the teacher's expertise.
<aside> ❌
Why not fully off-policy?
That would essentially become SFT with a reward function, preventing the target policy from bootstrapping itself and actively exploring.
</aside>
<aside> ❌
Why not fully on-policy?
That would restrict the model to its base capabilities and prevent it from learning new abilities. It also gets stuck on hard problems where all rollouts receive zero reward.
</aside>
We propose a hybrid post-training approach that incorporates offline expert knowledge into RL training through a logit fusion technique: