<aside> 🧠

TL;DR: We propose a hybrid post-training method that blends teacher and student logits at every decoding step, producing rollouts that are neither fully on-policy nor fully off-policy. Combined with importance-sampling correction and a decaying mixing coefficient, this logit fusion strategy lets smaller models absorb expert reasoning while retaining their own exploratory capabilities.

</aside>

<aside> 👥

Juzheng Zhang$^1$ · Abhimanyu Hans$^1$ · John Kirchenbauer$^1$ Micah Goldblum$^2$ · Ashwinee Panda$^1$ · Tom Goldstein$^1$

$^1$University of Maryland · $^2$Columbia University

:github:: Github link | Feb 25, 2026

</aside>

Screenshot 2026-02-20 at 6.03.56 AM.png


Motivation

Large language models (LLMs) have made remarkable strides in complex reasoning, yet training them effectively remains a persistent challenge. The two dominant paradigms each carry significant drawbacks:

<aside> ⚠️

The SFT-then-RL Dilemma

The dominant post-training pipeline applies SFT before RL in a two-stage fashion. Yet this paradigm does not consistently outperform pure RL. SFT can disrupt established patterns and induce overfitting, undermining the benefits of subsequent RL.

</aside>


The Landscape of Hybrid Approaches

Incorporating off-policy expert data into the on-policy RL loop is a promising strategy: it preserves RL's exploratory drive while injecting expert knowledge to unlock new reasoning abilities. Several families of approaches have emerged:


Our Approach: Logit Fusion at the Data Level

<aside> 💡

Key Insight. None of the existing methods fundamentally modify the data distribution. We address the on-policy / off-policy mixing challenge directly at the data level.

</aside>

By constructing a mixed behavior policy that blends student and teacher distributions and sampling from this mixture, we learn from data that reflects both the student's exploration and the teacher's expertise.

<aside> ❌

Why not fully off-policy?

That would essentially become SFT with a reward function, preventing the target policy from bootstrapping itself and actively exploring.

</aside>

<aside> ❌

Why not fully on-policy?

That would restrict the model to its base capabilities and prevent it from learning new abilities. It also gets stuck on hard problems where all rollouts receive zero reward.

</aside>

We propose a hybrid post-training approach that incorporates offline expert knowledge into RL training through a logit fusion technique: