The Long and Short of AI: Building Scalable Data Centers in the PCIe 6.x Era

By Abhishek Wadhwa, Senior Field Applications Engineer 

The rise of artificial intelligence (AI) and Generative AI are transforming how we interact with technology. From healthcare to business efficiency and groundbreaking research, AI and Generative AI are making waves. These AI marvels rely on vast amounts of hardware and infrastructure to function. As such, data centers are undergoing a revolution driven by the ever-growing demands and workloads required to train AI and Generative AI.

As shown in the left of Figure 1 below, early data center architectures consisted of racks of servers, each with a fixed amount of processing power and memory. As the workload requirements of the data centers shifted, this rigid setup led to resource stranding, where some servers had idle resources while others were maxed out [1]. In the past decade, servers were partially disaggregated to optimize workloads.  This allowed data center operators to separate compute, memory, and storage resources into individual building blocks as shown in the center rack of Figure 1. These blocks are dynamically allocated to workloads, maximizing utilization and efficiency. Disaggregation empowered data centers to adapt and scale seamlessly with changing workloads. Now, the AI based systems shown in the right of Figure 1 are being used to run efficient AI workloads.

Figure 1: Evolution of rack architecture from blade systems to AI/ML systems

PCIe Connectivity in an AI System

PCI Express® (PCIe®) technology is the workhorse of data center communication by offering high bandwidth, low latency, efficient data exchange between CPUs, GPUs, and other components. Current AI architectures are deployed with PCIe 5.0, which offers speeds of 32 GT/s @ 16 GHz. However, a Nyquist frequency of 16 GHz has high loss and limits the signal reach for PCIe 5.0 to 10-12 inches on an ultra-low-loss PCB material [2]. To ensure smooth communication with PCIe 5.0 architecture, system designers leverage Retimers to compensate considerable channel insertion loss and extend signal reach.

This architectural design shift changed PCIe connectivity from short to long reach. Figure 2 below represents a typical AI system. These systems can vary in design, but they typically involve connecting a head node server directly via PCIe to one or more “Just a Bunch of GPUs” (JBOG)/AI Accelerator baseboards which holds up to eight GPUs. The head node shown in Figure 2 is a two-socket server utilizing Aries 5 PCIe 5.0/CXL® Retimers for box-to-box server communication. A Host Interface board (HIB) connects the head node with the JBOG. Another set of Aries 5 Retimers facilitate robust communication between the HIB and JBOG.

Figure 2: AI system

The Long Reach – Scaling up GPU Clusters

Scaling up AI model training requires distributing the workload across thousands of GPUs. This technique, called data parallelism, necessitates efficient communication between GPUs for exchanging data. While PCIe interconnectivity within a JBOG enables data exchange within a single box, power limitations come into play when scaling up the GPU cluster.

The current generation of a single AI system can draw around 10 kW, limiting the number of AI systems deployed in a rack with a typical power density of 15-20 kW per rack [3]. To accommodate these power limitations, one approach distributes AI systems across multiple racks, spreading the power load more effectively (Figure 3). However, this approach creates a new challenge of efficiently connecting these distributed GPUs for scale-up data exchange.

PCIe connections with passive cables have a limited reach of approximately three meters for PCIe 5.0 (Figure 3) and higher bend radius, limiting the cable reach within the rack. Aries PCIe/CXL Smart Cable Modules (SCM) with built-in Retimers extend the reach up to seven meters for PCIe 5.0. These thinner, more flexible cables with increased bend radius also improve the rack’s airflow. This reach and flexibility enabled by Aries SCMs opens a new possibility of connections across multiple racks tying GPU clusters and AI Accelerators together to reduce power density.

Figure 3: Multi-rack GPU clustering

The Short Reach – Doubling Bandwidth with PCIe 6.0

As AI models double in computation every six months, the bandwidth for their collective training process needs an upgrade too [4]. PCIe 6.0 offers double the bandwidth compared to PCIe 5.0, pushing data rates up to 64 GT/s per lane. This leap presents even greater signal integrity (SI) challenges when compared to PCIe 5.0.  Figure 4 shows the NRZ signal (left) used by PCIe 5.0, and the PAM-4 signal (right) used by PCIe 6.0. PCIe 6.0 retains the Nyquist rate of 16 GHz and encodes two bits per unit interval (UI) to double the data rate. PAM-4 has four voltage levels and three eyes compared to two levels and one eye in NRZ signaling.

Since the overall voltage swing remains fixed, each eye in the PAM-4 system offers only one-third of the voltage available in NRZ as shown in the image below. This also renders the signal integrity far more susceptible to any noise during data transmission. Smaller PAM-4 eyes result in SNR degradation by 9.5 dB when compared to NRZ, further affecting the SI [5]. The complex signaling and higher data rates in PCIe 6.0 result in increased jitter, which can degrade signal quality, leading to errors. The system simulation budget for common clock architectures was trimmed to 0.15 psRMS in PCIe 6.0 from 0.25 psRMS in PCIe 5.0.

Figure 4: PCIe 6.0 implements PAM4 signaling resulting in tougher signal integrity challenges

The increase in the PCIe 6.0 data rate also leads to a reduction in the channel loss budget, making it more difficult to maintain signal integrity over longer distances. The channel loss budget for PCIe 6.0 is 32 dB, compared to 36 dB for PCIe 5.0, forcing designers to be more diligent at managing signal loss in their systems. Board design complexity increases due to limited rise time and transition amplitudes, requiring complex equalization and clock recovery mechanisms. Much like the PCIe 5.0 server architecture shown above, the Aries 6 Retimer aids in overcoming vital channel loss hurdles, serving as a powerful ally in dealing with board design complexities.

Aries Retimers – Extending PCIe 6.0 Reach

Figure 5 shows PCIe 6.0 signal reach for a traditional add-in-card topology with and without a Retimer. Without a Retimer, in PCIe 6.0 topologies, system board reach is limited to 4.3 inches for low-loss PCB material and 7.1 inches for ultra-low-loss PCB materials [6]. With a Retimer, this reach is more than doubled to 9.9 inches and 15.1 inches for a low-loss and ultra-low-loss board material respectively. 


 Figure 5: Adding a Retimer to the system increases signal reach, enabling additional design form-factors

Aries 6 Retimers are the gold standard solution for extending signal reach, offering the lowest risk path to market. Third generation Aries 6 represents the culmination of years of learnings from cloud-scale deployments with all major hyperscalers and AI platform providers. Aries 6 provides wide interoperability support, robust signal integrity, and SerDes and DSP optimized for demanding AI server channels. It also features enhanced diagnostics and telemetry via COSMOS (COnnectivity System Management and Optimization Software). Further enhancing its adaptability, Aries’ COSMOS and firmware eliminate the need for silicon re-spins, offering customizability for each platform. This makes Aries 6 the ideal choice for those seeking a reliable, efficient, and interoperable Retimer solution.

As next generation data centers disaggregate AI systems to support high power densities, connectivity becomes a critical bottleneck. While PCIe 6.0 technology offers double the bandwidth, it also introduces significant SI challenges. Overcoming these hurdles requires innovative solutions like Aries 6 Retimers and Aries Smart Cable Modules. As AI continues to evolve, ensuring efficient communication across distributed AI Accelerator baseboards will be crucial for unlocking its full potential. The future of data centers hinges on developing robust and scalable communication architectures that can keep pace with the ever-increasing demands of AI workloads.

References:

  1. 1. Lin, Y. Cheng, M. D. Andrade, L. Wosinska and J. Chen, “Disaggregated Data Centers: Challenges and Trade-offs,” in IEEE Communications Magazine, vol. 58, no. 2, pp. 20-26, February 2020, doi: 10.1109/MCOM.001.1900612.
  2. 2. https://www.asteralabs.com/smart-retimers/simulating-with-retimers-for-pcie-5-0/
  3. 3. https://www.electronicdesign.com/blogs/the-briefing/article/21278725/electronic-design-can-silicon-supply-enough-power-for-the-future-of-ai-silicon
  4. 4. https://www.visualcapitalist.com/cp/charted-history-exponential-growth-in-ai-computation
  5. 5. Intel’s AN 835: PAM4 Signaling Fundamentals
  6. 6. PCI-SIG. [T3S11 Design Considerations for PCIe6.0 Retimers by Casey Morrison]. Presented at PCI-SIG Conference.