Hybrid Policy Distillation: The New Way to Make Small AI Models Punch Above Their Weight
What if small AI models could absorb the intelligence of their giant counterparts — without the usual trade-offs?
A new paper introduces Hybrid Policy Distillation (HPD), a technique that fundamentally rethinks how knowledge is transferred from large language models to smaller ones. The research team, led by Wenhong Zhu, tackles a long-standing dilemma in knowledge distillation: forward KL divergence gives broad coverage but lacks precision, while reverse KL divergence captures peak performance but suffers from instability.
HPD elegantly merges both approaches into a unified framework. The key insight is reformulating all distillation methods as token-level reweighted log-likelihood objectives. This reveals hidden connections between existing approaches and opens the door to a hybrid solution.
The framework introduces three innovations. First, a hybrid KL objective that balances mode-covering and mode-seeking behaviors with a masking mechanism for fine-grained control. Second, a lightweight sampling strategy that combines off-policy data with approximate on-policy sampling, dramatically reducing computational costs. Third, a unified theoretical view that connects divergence direction, optimization strategy, and data regime.
Results are compelling. HPD outperforms existing methods across long-generation math reasoning, short-generation dialogue, and code generation tasks — and it works consistently across different model families and scales.
The implications are significant for the industry. As companies race to deploy AI on edge devices and reduce inference costs, HPD offers a path to creating compact models that retain much more of their teacher's capability. The code is publicly available on GitHub.
📄 Source
HuggingFace Daily Papers