Thinking Machine's New Research Achieves Stunning Breakthrough: "On-Policy Distillation" Proposes On-Policy Distillation to Solve Core Challenges in LLM Post-Training
The "on-policy distillation" technology proposed by the Thinking Machine team in their latest paper On-Policy Distillation has delivered a groundbreaking achievement in the field of large language model (LLM) post-training. Its core value lies in establishing a new post-training paradigm that balances "scenario adaptability" and "computational efficiency," effectively addressing key pain points of traditional methods. This technology innovatively combines the advantages of reinforcement learning (on-policy training) and off-policy distillation: it enables the student model to sample and generate sequences (trajectories) from itself, while a high-performance teacher model scores each token step-by-step to provide dense feedback. This not only ensures that the learned content is highly adaptable to the student model's actual application scenarios—avoiding the issue in off-policy distillation where "the student only learns scenarios familiar to the teacher and is prone to compound errors"—but also overcomes the low computational efficiency caused by sparse feedback in reinforcement learning, achieving a balance between LLM post-training performance and cost.
In terms of technical implementation, the research identifies "reverse Kullback-Leibler (KL) divergence" as the core loss function, which serves as a key underpinning for the efficient operation of on-policy distillation. This function measures the difference in token distribution between the student and teacher models on a per-token basis; minimizing this loss drives the student model to align with the teacher's high-quality behaviors across all its states. It also possesses three critical characteristics: "unhackable," ensuring that low KL values consistently correspond to valid behaviors recognized by the teacher and preventing feedback distortion; "mode-focused," guiding the student model to concentrate on learning the teacher's core optimal strategies rather than being distracted by suboptimal options; and "computationally friendly," supporting training based on partial trajectories and requiring only a single forward pass from the teacher model to complete feedback calculation—greatly reducing resource consumption and further enhancing the technology's practical value.
In terms of performance validation, the new research成果 demonstrates significant advantages through multi-scenario experiments. In mathematical reasoning tasks, compared with traditional off-policy distillation and reinforcement learning, on-policy distillation achieves higher benchmark scores with less training data and GPU hours, boasting a computational efficiency 9–30 times that of the other two methods. In scenario-specific model training (e.g., development of internal corporate assistants), it addresses the "catastrophic forgetting" issue common in traditional fine-tuning—where the model loses its original instruction-following capabilities when learning new domain knowledge. On-policy distillation not only restores the model's original behavioral capabilities but also retains or even improves its newly acquired domain knowledge, successfully solving the industry-wide problem of LLMs "forgetting old skills when learning new ones" and verifying the technology's effectiveness in practical applications.
The research成果 also extends multiple values, providing new pathways for LLM technology implementation and continuous iteration. In terms of efficiency, its dense feedback mechanism boosts computational efficiency by 50–100 times compared with traditional reinforcement learning; it also supports repeated use of a single training prompt, avoiding the "data memorization" issue common in reinforcement learning and significantly reducing reliance on the volume of training data. In terms of continuous learning, unlike off-policy distillation—where performance tends to degrade with training iterations—on-policy distillation maintains alignment with a fixed teacher model at all times. This enables the establishment of an alternating training model of "learning new knowledge → distilling to restore old capabilities," allowing LLMs to continuously absorb new information without losing existing abilities and paving the way for lifelong model updates.
Furthermore, this research成果 plays a crucial role in advancing the industrial implementation of LLMs, particularly removing barriers to the specialized and scenario-specific application of small models. While small models offer advantages such as local deployment (ensuring privacy and security) and low update costs, traditional post-training methods struggle to enable them to efficiently acquire professional capabilities. On-policy distillation, however, allows small models to quickly learn professional knowledge in vertical fields (e.g., healthcare, internal corporate services) while retaining basic capabilities like instruction following. This greatly lowers the application threshold of small models in industrial scenarios, provides key technical support for LLMs to penetrate more niche fields and achieve low-cost large-scale deployment, and demonstrates Thinking Machine's cutting-edge strength in LLM technology research and development.
No comments:
Post a Comment