Building the Case for UALink™: A Dedicated Scale-Up Memory Semantic Fabric

Understanding the Landscape

As AI models continue to rapidly evolve and drive a need for higher FLOPs and larger memory footprints, AI clusters will need to continue to grow to keep up. As the AI clusters rapidly scale in size from 25,000 GPUs per cluster^[1]to 100,000 GPUs to 200,000 GPUs^[2], not only is there a need to provide the connectivity between the GPUs in these clusters, but the interconnect needs to be efficient, high-bandwidth, low-latency, and reliable. The AI models need to be partitioned to distribute the workload across all of the GPUs in these clusters since the models no longer fit within a small number of GPUs.

There are a variety of evolving methods to accomplish this partitioning and it’s important to account for these approaches as a part of the AI Fabric architecture. In general, these approaches take advantage of various forms of parallelism that are inherent in the AI models such as tensor, data, pipeline, and expert parallelism. Cluster performance and efficiency can be maximized by architecting the AI Fabric to be composed of both a scale-up and a scale-out fabric that each align with these forms of parallelism. In other words, the AI Fabric is purpose built with a HW-SW co-designed approach.

Source: https://atscaleconference.com/videos/scheduler-and-sharding-considerations-for-network-efficiency/

Why Scale-Up Fabric Matters

When it comes to AI Fabrics, let’s start by clarifying what scale-up and scale-out mean in this context. A scale-up fabric provides low-latency and very high-bandwidth connections between 10s to 100s of GPUs. This fabric needs to be a memory semantic fabric to allow each GPU to read and write directly to any other GPU’s memory within the scale-up domain. Software can now view this pod, or group of GPUs connected via a scale-up fabric, as one giant GPU. The model can be partitioned across this pod using tensor or expert parallel methods which requires a very high rate of interactions and therefore low-latency and high-bandwidth. On the other hand, a scale-out fabric is necessary to connect these scale-up pods together to form clusters of 10,000s to 100,000s of GPUs. Achieving this level of scale requires multiple layers of switching, routable and re-routable connections, and congestion management. Scale-out is, however, more latency tolerant and can be designed with bandwidth oversubscription in mind to align with the data parallelism in AI models.

Source:https://atscaleconference.com/videos/scheduler-and-sharding-considerations-for-network-efficiency/

Source:https://atscaleconference.com/videos/scheduler-and-sharding-considerations-for-network-efficiency/

Finally, we often think about AI clusters in the context of training, but inference needs to be considered as well. Inference is an inherently latency sensitive workload as the servers are actively handling user requests while training can generally happen offline. While training clusters continue to grow in size to be able to train the largest models, those trained models will eventually be used for inference as well, and while they are a fraction of the size, they are also growing. Today’s inference servers are often limited to the compute and memory capabilities of eight GPUs in a scale-up domain. As the inference models continue to grow, for example driven by reasoning models, the scale-up domain needs to increase to 10s of GPUs to keep up while still maintaining fast response times (i.e. keep latency low).

The Case for UALink’s Dedicated Fabric

While scale-out AI Fabrics have InfiniBand and Ethernet options, the industry is lacking an open, standardized option for a scale-up memory semantic AI Fabric. Enter Ultra Accelerator Link^TM or UALink^TM for short. UALink is an open industry standard for a memory semantic, scale-up fabric protocol developed by the UALink Consortium. The specification enables load, store, and atomic operations between 100s of GPUs while optimizing the protocol stack to minimize end-to-end latency, reducing valuable die area on GPUs and switches, and reducing interconnect and switching power consumption. UALink will support state-of-the-art up to 200Gbps per lane (equivalent to Ethernet) to provide the high bandwidth required between GPUs while also keeping latency in the 100s of nanoseconds (vs. multiple microseconds for Ethernet).

The UALink standard was developed from AMD’s proven Infinity Fabric protocol and taps into the deep AI accelerator development and deployment experience of the Promoter member companies. By creating not only a specification for a scale-up AI Fabric, but also the surrounding ecosystem it allows the industry to build the silicon IP, validation, software, security, and management solutions required to deploy these solutions at scale. The ecosystem will benefit from the pooled innovation and validation experience.

Application Scenarios for UALink

UALink as a scale-up AI Fabric can be deployed for both AI training and AI inference solutions to support a broad spectrum of AI models. For AI training, UALink will enable the scale-up domain to increase to 100s of GPUs to support the needs of future LLMs and Transformer models. The higher bandwidth and lower latency of UALink will enable the continued scaling of GPU training performance for large foundation models, but also enables smaller models to train much faster and therefore more frequently. Finally, the power efficiency improvements of up to 40% that UALink enables, provides an opportunity to maximize the available datacenter power towards GPU computation and reduce the energy required to train new models.

The sub-microsecond latency of UALink will not only enable higher AI inference performance to minimize response times, but also allows inference servers to scale-up beyond 8 GPUs without requiring the workload to be partitioned across multiple servers further improving efficiency. As the variety of AI models that are deployed continues to increase, AI inference architectures will need to evolve to improve the AI server TCO to help providers maximize their ROI. The TCO efficiency improvements that UALink brings will directly benefit the ROI of LLM and recommendation system deployments.

Conclusion

Scorpio X-Series Fabric Switches for scale-up clusters are architected to deliver the highest backend GPU-to-GPU bandwidth while harnessing software-defined architecture for platform- specific customization.

Scorpio X-Series Fabric Switches

Driven by industry-leading hyperscalers and AI infrastructure providers, UALink is an open industry standardized interconnect that is purpose-built for GPU-to-GPU communication. It provides a memory semantic, low latency, high bandwidth scale-up AI Fabric that can efficiently connect 100s of GPUs together. As a UALink Consortium Promoter delivering silicon-based connectivity solutions for AI and Cloud, Astera Labs is collaborating with other industry leaders to deliver the UALink specifications and create an optimized scale-up ecosystem. We encourage the industry to join the UALink Consortium and get involved in developing the future of AI scale-up fabrics!

[1] https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/

[2] https://nvidianews.nvidia.com/news/spectrum-x-ethernet-networking-xai-colossus

[3] https://atscaleconference.com/videos/scheduler-and-sharding-considerations-for-network-efficiency/

Building the Case for UALink™: A Dedicated Scale-Up Memory Semantic Fabric

Share:

Related Articles