The incredible pace of hardware and software innovations across CPUs, accelerators (such as GPUs), and AI training and inference models is yielding ever-expanding benefits across virtually all sectors of society. However, the disjointed pace of these separate evolutionary paths introduces three main challenges:
- Mixed-generation components can cause bandwidth mismatches and inefficiencies.
- Reliability shortfalls extend training time because of multiple job restarts.
- The AI ecosystem is evolving as newer models use increasingly diverse CPU:GPU:NIC ratios.
Let’s explore these challenges in more detail and then examine a possible solution.
The Rising Need for Mixed Generation Support
Legacy CPU, GPU, and NIC development followed a two to two-and-a-half-year cadence. Today’s development is accelerating to keep pace with new AI training and inference requirements across a broadening spectrum of applications and use cases. Modern platforms must therefore support mixed technologies, such as mixed PCIe® generations, to integrate products on different production timelines.
Components that use the same PCIe generation are naturally bandwidth matched. For example, Figure 1 shows a single 400 Gbps PCIe Gen 5×16 NIC bandwidth-matched to a single PCIe Gen 5×16 GPU.
Figure 1: Bandwidth-matched NIC and GPU
Today’s GPUs have growing amounts of High Bandwidth Memory (HBM) and ever-faster data ingestion bandwidths while today’s best NICs still operate at 400 Gbps. This means that a PCIe Gen 5×16 400 Gbps NIC is bandwidth mismatched with the 1024 Gbps ingest rate of a PCIe Gen 6×16 GPU, resulting in underutilization of costly GPUs and slower AI training and inferencing.
Figure 2: Bandwidth-mismatched NIC and GPU
Eliminating Reliability Shortfalls to Optimize Training Time
Larger GPU clusters can process larger AI training and inference models faster than ever, but even a single failure, such as losing one GPU or network connection, can require the cluster to stop and restart. The Llama 3 Models published by Meta in July 2024 describes training a Llama 3 model with 405 billion parameters. This model encountered numerous problems during its 54-day pre-training period. NIC and switch/cable problems (which could be as simple as bumping into or unplugging a component) accounted for 42 interruptions or around 10% of all problems encountered.
Figure 3: Meta Llama 3 405B AI cluster failures by type
Interruptions can have many effects. Most require the training platform to pause, pull checkpoint data, and then restart. The 42 problems mentioned above add up to just under one failure every 19 hours.
The TensorFlow-style parallelism in these models uses GPUs in a massively parallel fashion that requires all GPUs to operate as a single device. A network failure that cuts off one GPU affects both that GPU and the entire cluster, which may need to pause or even restart the job.
If the GPU clusters consist of 4 or 8 GPUs, then a network card or cable failure requires those 4 to 8 GPUs to restart from the most recent checkpoint, which is both inconvenient and time consuming. A network card or cable failure in a cluster with 100,000s of GPUs could impact the entire cluster and force a restart with catastrophic effects. Reliability is crucial and will become a dominant concern in future AI factories. This is where having redundant components can help reduce the impact of these failures.
Figure 4: Network failures can force cluster restarts
Evolving Ecosystems Must Support Diverse Component Ratios
Traditional Host Interface Boards (HIB) are built around fixed CPU:GPU ratios. This is one reason why the OpenCompute Accelerator Module (OAM) specification that defined form factors and shapes facilitated scalability so well.
Emerging AI platforms support increasingly diverse CPU:GPU:NIC ratios. Meanwhile, hyperscalers and system integrators are deploying AI clusters at unprecedented speeds. Some examples include systems with two GPUs and one CPU, four GPUs and one CPU, or 4, 6, or 8 GPUs without a CPU. This proliferation of shapes and ratios complicates building systems with the adaptability required to support current and future AI evolution—and this problem is only getting worse.
Toward Modular AI Servers
Modular Host Interface Boards (HIBs) solve three critical challenges:
- Atomic data ingestion boards align to the GPU and can scale across different GPU:CPU:NIC ratios to address training and/or inference.
- Platforms with PCIe 6 GPUs can utilize existing PCIe 5 NICs for faster time to market.
- Network layer redundancy reduces job restarts by effectively eliminating a single point of failure in existing architectures.
Figure 5: Sample HIB configurations
Figure 6: Modular dual-NIC HIB bandwidth-matches PCIe 5 NICs with a PCIe 6 GPU with redundancy
Creating a modular architecture around data ingestion for the GPU or accelerator allows the system integrator to:
- Future-proof the platform architecture.
- Reuse the platform across different inference and training use cases.
- Leverage mixed-rate and mixed-generation products for rapid deployment.
- Provide critical redundancy to reduce job restarts and potentially recoup millions of dollars of recovered value.
Scorpio Smart Fabric Switches deliver predictable and high-performance AI dataflows and are purpose-built to address the needs of modular AI infrastructure. Scorpio P-Series Fabric Switches are architected for mixed traffic head-node connectivity across a diverse ecosystem of PCIe hosts and endpoints. Due to its modular nature, each Scorpio P-Series Fabric Switch dedicates ingest and scale-out resources to each GPU. This avoids the contention risk of traffic to/from a neighboring GPU consuming an uneven proportion of the switch core and delivers maximum predictable performance with increased efficiency and GPU utilization.
Scorpio comes with Astera Labs’ COSMOS software suite, delivering smart, customizable fleet management that forms the connectivity backbone of the datacenter providing unprecedented data center observability, enhanced security, and extensive fleet management capabilities. When combined, Scorpio and COSMOS can prevent job-restarts through redundancy of the NIC, and can pinpoint and repair links that fail, restoring the platform to full bandwidth without costly datacenter technician callouts.
Check out additional resources for more information: