Thursday, October 23, 2025

Alibaba Cloud's Aegaeon Solution Makes a

Alibaba Cloud's Aegaeon Solution Makes a Sensational Debut: Slashing NVIDIA GPU Demand by 82%, Heralding a Revolution in Large Model Computing Efficiency
 
On October 18, 2025, at the 31st Symposium on Operating Systems Principles (SOSP) held in Seoul, South Korea, the research achievements of the "Aegaeon" computing pooling solution—jointly developed by Alibaba Cloud and Peking University—were successfully selected. This technological breakthrough has unlocked new possibilities for optimizing computing resources in the field of AI large models.
 
For a long time, GPU resource waste has been a common issue in AI model services, with massive computing power left idle or underutilized. Particularly when running large-parameter models (with parameters reaching billions), hardware costs have remained stubbornly high. The launch of the Aegaeon solution has completely addressed this challenge. In tests on Alibaba Cloud's Model Market, the solution delivered striking results: when serving dozens of large models with up to 72 billion parameters, the number of NVIDIA H20 GPUs required plummeted from 1,192 to 213, directly cutting GPU usage by 82% and significantly reducing the hardware costs associated with large model deployment.
 
The exceptional performance of Aegaeon stems from the synergy of its four core innovative technologies. First is token-level auto-scaling technology, which abandons the traditional fixed model of "one model tied to one GPU". It dynamically decides whether to switch models each time a token is generated, flexibly assigning fragmented computing tasks to a shared GPU pool to achieve ultra-precise resource scheduling. Second, component reuse technology allows identical components across different models to be shared, eliminating redundant loading and boosting resource utilization. Third, fine-grained memory management technology prevents memory waste through intelligent memory allocation and recycling, further unlocking the potential of GPUs. Finally, KV cache synchronization optimization technology acts as a critical acceleration driver: it reduces model switching time from 26.9 seconds to 0.8 seconds (a 97% reduction in switching overhead), ensuring responsive performance during multi-model concurrency and ultimately achieving an impressive up to 9x increase in effective throughput.
 
Currently, the Aegaeon computing pooling solution has been officially deployed on Alibaba Cloud's Bailian Platform. This not only provides Alibaba Cloud users with more efficient and cost-effective support for large model services but also offers a replicable technical paradigm for computing efficiency optimization across the entire AI industry—propelling large models from "high-cost operation" toward "high-efficiency implementation".
 

No comments:

Post a Comment