Tuesday, October 28, 2025

Ant Group Publicly Releases Training Det

Ant Group Publicly Releases Training Details of Ling 2.0: Unveiling MoE Architecture, FP8 Training, and Multi-Stage Strategies
 
On October 25, 2025, Ant Group released a technical report on the arXiv platform, publicly disclosing the training details of its large language model, Ling 2.0, for the first time, providing the industry with a reference that combines technical depth and practical value. In terms of model architecture design, Ling 2.0 adopts a unified Mixture of Experts (MoE) basic architecture, configured with 256 routing experts, 8 active experts, and 1 shared expert, with an overall activation rate controlled at approximately 3.5%, ensuring both model performance and efficiency. Meanwhile, the architecture integrates the aux-loss-free load balancing strategy and Multi-Token Prediction (MTP) technology to further enhance the model's stability and prediction accuracy. Combined with the Ling scaling law to achieve precise extrapolation and expansion, it provides flexible support for the model's adaptation in different scenarios.
 
The pre-training phase is crucial for Ling 2.0 to build core capabilities. It relies on a high-quality dataset of over 20T tokens, covering fields such as common sense, code, mathematics, and multilingual content, ensuring the comprehensiveness and diversity of the model's knowledge system. A multi-stage strategy is adopted during the training process: first, general pre-training is completed on a large-scale general corpus to consolidate the model's basic language understanding and generation capabilities; then, mid-term training is conducted on a medium-scale, task-specific corpus to targeted enhance the model's adaptability in specific scenarios; at the same time, the context length is extended to 128K, significantly improving the model's ability to process long texts. In addition, the team innovatively introduced Chain of Thought (CoT) data to activate the model's logical reasoning ability in advance, laying a foundation for subsequent task performance optimization. The adoption of the WSM (Warm-up, Stabilization, Merging) scheduler, which replaces the traditional Learning Rate (LR) decay with checkpoint merging, makes the training process more efficient and stable.
 
In the post-training phase, Ling 2.0 continuously optimizes the model performance through methods such as separately trained supervised fine-tuning and evolutionary reasoning. Especially in the reinforcement learning phase, an innovative LPO (Linguistics-Unit Policy Optimization, LingPO) method is proposed. This method performs policy optimization at the "sentence" granularity, which is more in line with the expression logic of natural language and can effectively improve the coherence and rationality of the model's output. At the same time, the team also designed a hybrid reward mechanism of "Grammar-Function-Aesthetics", guiding the model from multiple dimensions such as language standardization, task practicality, and expression fluency, ensuring that the model's output not only meets functional requirements but also achieves a higher standard in terms of language quality.
 
Notably, Ling 2.0 has also achieved important breakthroughs in training infrastructure. It adopts FP8 mixed-precision training throughout the entire process, making it the largest known foundation model trained using FP8. This technical choice brings significant advantages: on the one hand, it greatly saves memory usage, making the training of larger-scale models possible; on the other hand, it supports more flexible parallel segmentation strategies, improving the utilization of hardware resources; finally, it achieves an end-to-end acceleration of over 15%, providing a new technical path for the efficient training of large models and valuable practical experience for the industry in optimizing large model training efficiency.

No comments:

Post a Comment