<aside> 🧠

TL;DR: We propose a hybrid post-training method that blends teacher and student logits at every decoding step, producing rollouts that are neither fully on-policy nor fully off-policy. Combined with importance-sampling correction and a decaying mixing coefficient, this logit fusion strategy lets smaller models absorb expert reasoning while retaining their own exploratory capabilities.

</aside>

<aside> 👥

Juzheng Zhang$^1$ · Abhimanyu Hans$^1$ · John Kirchenbauer$^1$ Micah Goldblum$^2$ · Ashwinee Panda$^1$ · Tom Goldstein$^1$

$^1$University of Maryland · $^2$Columbia University

:github:: Github link | Feb 25, 2026

</aside>

Screenshot 2026-02-20 at 6.03.56 AM.png

Motivation

Large language models (LLMs) have made remarkable strides in complex reasoning, yet training them effectively remains a persistent challenge. The two dominant paradigms each carry significant drawbacks:

<aside> ⚠️

The SFT-then-RL Dilemma

The dominant post-training pipeline applies SFT before RL in a two-stage fashion. Yet this paradigm does not consistently outperform pure RL. SFT can disrupt established patterns and induce overfitting, undermining the benefits of subsequent RL.

</aside>

The Landscape of Hybrid Approaches

Incorporating off-policy expert data into the on-policy RL loop is a promising strategy: it preserves RL's exploratory drive while injecting expert knowledge to unlock new reasoning abilities. Several families of approaches have emerged:

Our Approach: Logit Fusion at the Data Level

<aside> 💡

Key Insight. None of the existing methods fundamentally modify the data distribution. We address the on-policy / off-policy mixing challenge directly at the data level.

</aside>

By constructing a mixed behavior policy that blends student and teacher distributions and sampling from this mixture, we learn from data that reflects both the student's exploration and the teacher's expertise.

<aside> ❌

Why not fully off-policy?

That would essentially become SFT with a reward function, preventing the target policy from bootstrapping itself and actively exploring.

</aside>

<aside> ❌

Why not fully on-policy?

That would restrict the model to its base capabilities and prevent it from learning new abilities. It also gets stuck on hard problems where all rollouts receive zero reward.

</aside>

We propose a hybrid post-training approach that incorporates offline expert knowledge into RL training through a logit fusion technique:

We form a fused-logit behavior policy by linearly interpolating the teacher and student logits at each decoding step, then sample the next token from the mixed distribution.

This generates coherent sequences while distilling knowledge from the teacher without straying too far off-policy, maintaining semi-on-policy characteristics and leveraging the student's exploratory capabilities. Our method unifies external supervision with self-exploration, enabling smooth transitions from supervised tuning to fully autonomous reinforcement learning.

Off-policy training is inherently challenging, and previous works have noted optimization difficulties, often limiting off-policy traces to just one in a group of eight.