As large-scale model training enters the “tens of thousands of GPUs” era, traditional Scale-Out clusters face increasing challenges in chip coordination, such as high network latency and insufficient bandwidth, which often lead to low GPU utilization (sometimes only 30–40%).
The SuperPod architecture addresses these issues by leveraging high-speed internal interconnects (such as buses or switches) to create a low-latency, high-bandwidth environment. This enables faster data synchronization and parameter exchange between GPUs, significantly shortening training cycles.
At the same time, SuperPod simplify large-scale cluster network deployment and operations, reduce costs, and have emerged as the optimal solution to meet the explosive demand for AI computing power.
A SuperPod refers to a highly integrated computing architecture that integrates multiple computing chips (such as GPUs) into a logically unified computing entity through high-speed interconnection technology. It is a technical architecture used to build large-scale computing power clusters, integrating multiple computing chips (such as GPUs or NPUs) through high-speed interconnection technology to form scaled computing units. The core purpose is to solve the computing power coordination and efficiency issues in AI large model training. This architecture was first proposed by NVIDIA, aiming to optimize large-scale parallel computing tasks.
The explosion of supernodes is not a “spontaneous innovation” in technology, but an “inevitable result” of the development of AI large models. As the parameter scale of AI large models leaps from “billion-level” to “trillion-level” (and may even break through to “ten-trillion-level” in the future), traditional computing architectures gradually expose two fatal problems:
Insufficient bandwidth: Large model training requires frequent data transmission between GPUs. The PCIe 4.0 bus relied on by traditional architectures has a single-channel bandwidth of only 16 GT/s. Data transmission is like “walking on rural paths,” and no matter how powerful the chips are, they will be slowed down by “data congestion”;
Too high latency: In multi-device collaborative computing, even if the communication delay between devices increases by 1 millisecond, accumulated it will extend the training cycle from “a few weeks” to “several months,” seriously affecting R&D efficiency.
In AI network architecture, vertical scaling (Scale Up) and horizontal scaling (Scale Out) are two core paths for performance enhancement. They target different performance bottlenecks and are not mutually exclusive or substitutive, but rather two directions that develop synergistically.
Scale Out (Horizontal Scaling):
The focus is on scaling GPU clusters by increasing the number of nodes to boost overall computing power. Its core goal is to achieve more flexible, reliable, and elastic resource scheduling. Currently, Scale Out networks primarily adopt communication solutions based on RDMA (Remote Direct Memory Access) technology, such as InfiniBand and RoCEv2, to enable efficient inter-node interconnections.
Scale Up (Vertical Scaling):
It primarily addresses the demands for ultra-high bandwidth and extremely low latency in AI scenarios. Its characteristics include extremely high internal node interconnection bandwidth and extremely low latency, making it suitable for strong scaling scenarios in large model training. Currently, Scale Up networks often employ proprietary high-speed interconnection protocols similar to NVLink.
Unlike the traditional cloud computing era, the AI computing system is undergoing a shift from being “computation-centric” to “network-centric.”
The bottlenecks in traditional cloud computing are mostly concentrated at the computation level, while in the AI era, network performance has become a key factor determining the overall computing power utilization rate.
Current AI accelerator cards (such as GPUs and TPUs) are inherently powerful, but if network bandwidth or latency is insufficient, it will lead to idle computing resources and affect overall efficiency.
Therefore: Scale Up is the key to unleashing the potential of single-node computing power; Scale Out is the inevitable path to building large-scale AI clusters and achieving computing power expansion. The synergistic development of the two forms the foundation of modern AI supercomputing network architectures.
The concept of the Super Node represents the convergence of vertical (Scale-Up) and horizontal (Scale-Out) scaling. In the past, AI clusters primarily focused on Scale-Out expansion, but today, the Super Node architecture combines the strengths of both approaches to create a more efficient hybrid network structure.
Early Scale-Up architectures often relied on proprietary, closed protocols such as NVLink. Now, new interconnect standards — including UCIe, SIE, and OSA — are driving open and standardized development, enhancing cross-vendor compatibility and fostering a more collaborative ecosystem.
By embedding computing capabilities into the network (e.g., for data aggregation and distribution optimization), communication overhead can be significantly reduced, improving overall compute efficiency. This is emerging as one of the key directions for future AI cluster optimization.
To overcome the bandwidth and distance limitations of copper cables, AI networks are increasingly adopting optical-electrical hybrid interconnects. Optical links are used for long-distance, high-bandwidth connections, while copper interconnects remain ideal for short-range, low-latency communication—striking a balance between performance and cost.
SuperPod, as a highly integrated computing architecture, excel in large-scale AI training and inference. By tightly interconnecting multiple accelerators (such as GPUs or NPUs) to form a logically unified computing entity, they address the bottlenecks of traditional clusters. The following are its main advantages:
High interconnection bandwidth and low latency: Supernodes internally adopt high-speed buses (such as NVLink or proprietary interconnections) to achieve ultra-large bandwidth (e.g., several TB/s) and ultra-low latency, accelerating data synchronization and parameter exchange. In communication-intensive tasks like MoE models, performance can be improved by more than 3 times. This is much lower than the network latency of traditional Scale-Out clusters, boosting chip utilization from 30-40% to near-linear scaling.
High computing power density: Encapsulating hundreds or thousands of chips into a single node provides higher computing density, supporting ten-thousand-card-level parallel training. For example, the Huawei Atlas 950 supernode significantly enhances computing power, memory capacity, and access speed, greatly improving inference throughput.
Overall system efficiency optimization: Breaking through single-machine boundaries, it operates like an “AI super server” to achieve ultra-high efficiency. In large-scale clusters, performance can be improved by 2-3 times.
Lower Deployment and Maintenance Costs:
SuperPods simplify network topology by reducing the number of switches and cables required. This decreases deployment complexity and improves operational efficiency. Compared with traditional clusters, they can save up to 30% in rack space and overall costs.
Reduced Global Bandwidth Investment:
By focusing on localized high-bandwidth optimization, SuperPods avoid the massive expenses of large-scale network upgrades while achieving higher overall efficiency. In terms of networking, this architecture offers significant advantages in scalability and performance balance.
Bridging the Chip Process Gap:
For domestic AI chips, SuperPod architectures can enhance overall system performance through system-level optimization, rather than relying solely on advanced semiconductor processes. This provides an effective short-term strategy for improving competitiveness and compute efficiency.
High Energy Efficiency:
The integrated design of SuperPods reduces redundant components and protocol conversions, significantly lowering power consumption. A typical SuperPod configuration (e.g., with 32 GPUs) requires only around 7 kW of power while still delivering high computational performance.
Green Computing Support:
By leveraging optical interconnects (such as CPO) and advanced cooling technologies, SuperPods optimize energy utilization, making them ideal for high-density data centers and promoting the sustainable evolution of AI infrastructure.
Flexible Expansion:
SuperPods support dynamic resource allocation modes (such as Beast, Freestyle, or Swarm), enabling multi-user sharing and seamless scaling into larger superclusters. The larger the deployment, the more pronounced the performance benefits—but it also requires careful balance in architectural design.
System-Level Optimization:
A SuperPod is not a simple stack of nodes, but a fully optimized full-stack architecture. Companies like Huawei, for example, have introduced innovative interconnect technologies that exemplify this systematic approach and set a new paradigm for AI infrastructure.
Overall, SuperPods have become the new cornerstone of AI computing, especially under the momentum of large models such as ChatGPT. Their combined advantages in performance, cost, and efficiency are driving global tech giants like NVIDIA and Huawei to accelerate deployment. However, their benefits should be evaluated based on specific application scenarios—for smaller workloads, traditional clusters may still offer greater flexibility.
NVIDIA’s GB200 NVL72 is a liquid-cooled rack-scale AI supernode system that integrates 36 Grace CPUs and 72 Blackwell GPUs, forming a single massive GPU domain via fifth-generation NVLink interconnections, providing over 1 PB/s bandwidth and 240 TB of fast memory. Optimized for trillion-parameter large model training and real-time inference, it delivers 30x faster inference performance, 25x lower TCO, and energy consumption. The system is suited for large-scale AI workloads and supports exascale computing.
Huawei’s CloudMatrix 384 is a high-density AI computing supernode integrating 384 Ascend 910C NPUs and 192 Kunpeng CPUs, using a “supernode” architecture for ultra-high-speed chip interconnections and supporting 6912×400G SiPh LPO optical modules. Designed for large-scale AI model training and inference, it provides 300 BF16 PFLOPs computing power, competing with NVIDIA’s GB200 NVL72, and is suitable for enterprises and research institutions. The system emphasizes domestic self-reliance to address AI-era computing needs.
Alibaba’s Panjiu 128 (Panjiu AI Infra 2.0) is a cloud-native supernode server integrating 128 AI chips per rack, using self-developed CIPU 2.0 chips and EIC/MOC high-performance network cards, supporting open architectures and HPN 8.0 networks. Optimized for cloud-native infrastructure, it improves inference performance by 50% over traditional architectures, suitable for large-scale GPU clusters with ultra-low latency and high reliability. This system is the core of Alibaba Cloud’s AI Infra 2.0, aiding AI-native paths.
The emergence of SuperPods marks a new phase in the evolution of AI computing. Beyond a hardware innovation, it represents a shift from incremental scaling to system-level optimization. With their integrated design—offering ultra-high bandwidth, low latency, and energy efficiency—SuperPods provide unmatched scalability and cost-effectiveness for large-scale model training and inference.
Looking ahead, as optical–electrical convergence, in-network computing, and open interconnect standards continue to advance, SuperPods will stand at the core of next-generation AI infrastructure—driving the transition from centralized computing to intelligent, interconnected compute networks.