Optimizing Deep Learning Recommendation Models with AMD EPYC™ 9005 & Leo CXL Smart Memory Controllers

Michael Ocampo, Ecosystem Alliance Manager

By Michael Ocampo, Ecosystem Alliance Manager, Astera Labs, and Hasan Maruf, Server Systems Architect, AMD

Introduction

Recommendation systems are widely used in a variety of today’s popular applications, such as those commonly used for video streaming, social media, advertisement delivery, and other online services to enhance the user experience, drive engagement, and increase sales. This widespread adoption is largely due to significant advancements and effectiveness of Deep Learning Recommendation Models – or DLRMs – which are a machine learning architecture for large-scale recommendation systems. Deep Learning (DL) models, in particular, are often complex, with numerous layers processing millions of parameters.

To support the growing adoption of these models, the industry must overcome several challenges, including securing the computational resources needed to run complex algorithms, ensuring sufficient memory capacity to process large datasets, achieving low-latency and high-bandwidth performance, maintaining accuracy and reliability, and last but not least, cost effectiveness.

DLRMs utilize neural networks to predict user-item interactions within large datasets. While high-end GPUs are required to deliver the computational processing power needed for Large Language Models, they are more power hungry and expensive than CPUs. CPUs offer a cost-effective, power efficient inference solution with sufficient compute and memory capacity for DLRMs. By integrating AI inferencing acceleration directly into the CPU, DLRMs can be deployed in compact form factors, making them ideal for edge data centers to reduce round-trip latency and enhance the user experience.

This blog highlights how AMD’s EPYC CPU, Astera Labs’ Leo CXL Smart Memory Controller, and DDR5 memory can offer a comprehensive end-to-end CXL-based memory solution that addresses the challenges mentioned above, meeting the ever-growing demands of these DL models.

DLRM Overview

DLRMs combine principles from both collaborative filtering and predictive analytics-based approaches, allowing them to perform efficiently at production-scale and provide state-of-the-art results. They are designed to handle both continuous (dense) and categorical (sparse) features that describe users and products. The primary components of DLRM include:

  1. Embedding Layers (Handling Sparse Features)
    • Examples of inputs include categorical data such as user IDs or product IDs
    • Each category is assigned in an embedding table, which allows the model to efficiently process and compare them
  2. Dense Features and Bottom MLP (Handling Numerical Data)
    • Examples of features include numerical data such as age, clicks, counts, or time spent on a website
    • These dense features are processed through a bottom MLP (Multilayer Perception), which applies a series of fully connected neural network layers
  3. Feature Interaction Module (Capturing Relationship)
    • The core of DLRM is its ability to model interactions between features
    • These interactions are captured using a dot product operation between the dense vectors produced by the embedding layers
  4. Top MLP (Making Predictions)
    • After computing feature interactions, the outputs are fed into the top MLP
    • This network processes the combined interactions and generates the final recommendation score or probability

Figure 1: DLRM Schematic Representation

Challenges with CPU-based Inferencing for DLRMs

Complexity

As shown in Figure 2 above, the DLRM benchmark includes compute-dominated MLPs and memory capacity-limited embeddings. The embedding layers map each category to a dense representation before feeding it into the MLPs. The interaction layer processes these sparse and dense features by computing over all the embedding vectors inputs. The interactions are then fed into a top-level MLP to compute the likelihood of interaction between a user and an item pair. For example, “How does a specific user interact with a specific TV show? How long does the user watch the show? Did the user click on other similar shows?”

This level of complexity requires processors with higher core counts to handle data parallelism, improving MLP performance and enabling faster, more complex computations.

Capacity constraints

A significant portion of the need for increased memory capacity at high bandwidth comes from sparse features stored as indices in embedding tables. Dense features and the interaction layer require high bandwidth, but low capacity. Therefore, addressing the memory capacity requirements to efficiently run dot product operations becomes crucial as these models scale. Insufficient memory capacity leads to a slow recommendation, due to pulling in embeddings from a disk, or in some cases, over the network, resulting in a poor user experience and potentially, loss of revenue.

Cost effectiveness

One of the primary challenges in supporting DLRMs is affordability, due to the high cost of processors and associated memory. As AI models demand more memory, the cost per gigabyte ($/GB) is increasing significantly with next-generation memory architectures, such as 3DS DIMMs. An alternative solution could involve using storage solutions like SSDs to reduce the total cost of ownership (TCO), but this approach could significantly impact performance, as data would need to be loaded from the I/O subsystem to the main memory.

Overcoming the Challenges with CXL®

Compute Express Link® (CXL) Type 3 devices provide an additional memory tier expand beyond between HBM and CPU-attached DRAM that can be leveraged for recommender systems. This enables memory expansion which is critical for DLRMs that process huge datasets. As DLRM sizes grow to improve accuracy, CXL memory helps reduce inferencing time by using a faster memory tier compared to a flash tier, and at a much lower cost than adding costly 3DS DIMMs or additional processors to simply add more memory with monolithic DDR5 dies on each memory module.

In terms of performance, CXL memory delivers the same latency as a NUMA hop and offers a tight latency profile that meets the requirements of inference engines. Using CXL-based memory for read and write operations when staging large datasets significantly reduces the processor execution time.

CXL requires hardware support from both the CPU and devices. The latest AMD 5th Gen EPYC Processor coupled with Leo CXL Smart Memory Controllers provides that optimal combination of CXL 2.0-based memory expansion to address the memory capacity and bandwidth challenges faced by DLRM systems. Leo expands memory using 96GB DDR5-5600MT/s DIMMs, effectively doubling the system memory, which enhances both the speed and accuracy of AI models. AMD and Astera Labs have worked closely together to ensure Leo CXL Memory Controllers seamlessly interoperate with AMD’s EPYC 9005 platform and are optimized for AI inferencing to accelerate recommendation systems.

Performance Results

The test system ranDLRM traces with Zen Deep Neural Network (ZenDNN) developed by AMD, on a system equipped with AMD 5th Gen EPYC processor and Leo CXL Smart Memory Controllers. ZenDNN is a deep neural network acceleration inference library, comprised of a set of fundamental building blocks and APIs to enhance AI inferencing applications.

Here are the key components of the test system:

ComponentDescriptionQTY
SystemAS-4125GS-TNRT, 4U Supermicro System, supporting up to 8 DW AICs1
CPUAMD 5th Gen EPYC Processor1
Memory64GB DDR5-5600 Memory Modules12
CXL Memory ControllerLeo A1000 CXL Smart Memory Add-in Card4
CXL-attached Memory64GB DDR5-5600 Memory Modules8

With this hardware, two configurations were tested to optimize performance: batch threading and table threading. In batch threading, each core accesses multiple tables by parallelizing across the batches. In contrast, table threading pins each table to a specific core. These hardware and specified test configurations helped demonstrate the benefits of CXL for the DLRM workload.

The results show:

  • Up to 73% boost in AI inferencing performance with CXL
  • Up to 66% more memory bandwidth
  • Plug-and-play support for CXL interleaving

CXL enables system designers to deliver high-performance, high-capacity solutions with low TCO and RAS features that are critical for hyperscale and enterprise deployments. Leo CXL Memory Controllers, in conjunction with the AMD EPYC Processor and DDR5 memory, have undergone intensive feature validation, thorough interoperability testing, and rigorous application-level testing, making them ready for cloud-scale deployment.  Furthermore, developing a CXL-memory-aware dynamic page allocation policy could yield additional improvements in the efficacy of using CXL for DL models.

Conclusion

Leo CXL Memory Controllers increased memory capacity and bandwidth by 66% to supercharge DLRMs, thereby boosting AI inferencing performance by 73%. In the AI market, where data-intensive applications demand faster processing and larger memory pools, Astera Labs and AMD have developed a compelling solution to enhance DLRM efficiency. This improvement enables higher-quality recommendations for advertisements, product placements, social media updates, and beyond.

About Michael Ocampo, Ecosystem Alliance Manager

Michael is an evangelist for open ecosystems to accelerate hybrid cloud, Enterprise and AI solutions. With over a decade in x86 system integration, IaaS, PaaS, and SaaS, he offers valuable customer insights to cloud and system architects designing high-speed connectivity solutions for AI Training, Inferencing, Cloud, and Edge Computing Infrastructure. At Astera Labs, he leads ecosystem alliances and owns the Cloud-Scale Interop Lab to drive seamless interoperability of HW and SW solutions to optimize TCO and performance of infrastructure services.

Share:

Related Articles