By Michael Ocampo, Ecosystem Alliance Manager, Astera Labs, and Hasan Maruf, Server Systems Architect, AMD
Introduction
Recommendation systems are widely used in a variety of today’s popular applications, such as those commonly used for video streaming, social media, advertisement delivery, and other online services to enhance the user experience, drive engagement, and increase sales. This widespread adoption is largely due to significant advancements and effectiveness of Deep Learning Recommendation Models – or DLRMs – which are a machine learning architecture for large-scale recommendation systems. Deep Learning (DL) models, in particular, are often complex, with numerous layers processing millions of parameters.
To support the growing adoption of these models, the industry must overcome several challenges, including securing the computational resources needed to run complex algorithms, ensuring sufficient memory capacity to process large datasets, achieving low-latency and high-bandwidth performance, maintaining accuracy and reliability, and last but not least, cost effectiveness.
DLRMs utilize neural networks to predict user-item interactions within large datasets. While high-end GPUs are required to deliver the computational processing power needed for Large Language Models, they are more power hungry and expensive than CPUs. CPUs offer a cost-effective, power efficient inference solution with sufficient compute and memory capacity for DLRMs. By integrating AI inferencing acceleration directly into the CPU, DLRMs can be deployed in compact form factors, making them ideal for edge data centers to reduce round-trip latency and enhance the user experience.
This blog highlights how AMD’s EPYC CPU, Astera Labs’ Leo CXL Smart Memory Controller, and DDR5 memory can offer a comprehensive end-to-end CXL-based memory solution that addresses the challenges mentioned above, meeting the ever-growing demands of these DL models.
DLRM Overview
DLRMs combine principles from both collaborative filtering and predictive analytics-based approaches, allowing them to perform efficiently at production-scale and provide state-of-the-art results. They are designed to handle both continuous (dense) and categorical (sparse) features that describe users and products. The primary components of DLRM include:
- Embedding Layers (Handling Sparse Features)
- Examples of inputs include categorical data such as user IDs or product IDs
- Each category is assigned in an embedding table, which allows the model to efficiently process and compare them
- Dense Features and Bottom MLP (Handling Numerical Data)
- Examples of features include numerical data such as age, clicks, counts, or time spent on a website
- These dense features are processed through a bottom MLP (Multilayer Perception), which applies a series of fully connected neural network layers
- Feature Interaction Module (Capturing Relationship)
- The core of DLRM is its ability to model interactions between features
- These interactions are captured using a dot product operation between the dense vectors produced by the embedding layers
- Top MLP (Making Predictions)
- After computing feature interactions, the outputs are fed into the top MLP
- This network processes the combined interactions and generates the final recommendation score or probability
Figure 1: DLRM Schematic Representation
Challenges with CPU-based Inferencing for DLRMs
Complexity
As shown in Figure 2 above, the DLRM benchmark includes compute-dominated MLPs and memory capacity-limited embeddings. The embedding layers map each category to a dense representation before feeding it into the MLPs. The interaction layer processes these sparse and dense features by computing over all the embedding vectors inputs. The interactions are then fed into a top-level MLP to compute the likelihood of interaction between a user and an item pair. For example, “How does a specific user interact with a specific TV show? How long does the user watch the show? Did the user click on other similar shows?”
This level of complexity requires processors with higher core counts to handle data parallelism, improving MLP performance and enabling faster, more complex computations.
Capacity constraints
A significant portion of the need for increased memory capacity at high bandwidth comes from sparse features stored as indices in embedding tables. Dense features and the interaction layer require high bandwidth, but low capacity. Therefore, addressing the memory capacity requirements to efficiently run dot product operations becomes crucial as these models scale. Insufficient memory capacity leads to a slow recommendation, due to pulling in embeddings from a disk, or in some cases, over the network, resulting in a poor user experience and potentially, loss of revenue.
Cost effectiveness
One of the primary challenges in supporting DLRMs is affordability, due to the high cost of processors and associated memory. As AI models demand more memory, the cost per gigabyte ($/GB) is increasing significantly with next-generation memory architectures, such as 3DS DIMMs. An alternative solution could involve using storage solutions like SSDs to reduce the total cost of ownership (TCO), but this approach could significantly impact performance, as data would need to be loaded from the I/O subsystem to the main memory.
Overcoming the Challenges with CXL®
Compute Express Link® (CXL) Type 3 devices provide an additional memory tier expand beyond between HBM and CPU-attached DRAM that can be leveraged for recommender systems. This enables memory expansion which is critical for DLRMs that process huge datasets. As DLRM sizes grow to improve accuracy, CXL memory helps reduce inferencing time by using a faster memory tier compared to a flash tier, and at a much lower cost than adding costly 3DS DIMMs or additional processors to simply add more memory with monolithic DDR5 dies on each memory module.
In terms of performance, CXL memory delivers the same latency as a NUMA hop and offers a tight latency profile that meets the requirements of inference engines. Using CXL-based memory for read and write operations when staging large datasets significantly reduces the processor execution time.
CXL requires hardware support from both the CPU and devices. The latest AMD 5th Gen EPYC Processor coupled with Leo CXL Smart Memory Controllers provides that optimal combination of CXL 2.0-based memory expansion to address the memory capacity and bandwidth challenges faced by DLRM systems. Leo expands memory using 96GB DDR5-5600MT/s DIMMs, effectively doubling the system memory, which enhances both the speed and accuracy of AI models. AMD and Astera Labs have worked closely together to ensure Leo CXL Memory Controllers seamlessly interoperate with AMD’s EPYC 9005 platform and are optimized for AI inferencing to accelerate recommendation systems.
Performance Results
The test system ranDLRM traces with Zen Deep Neural Network (ZenDNN) developed by AMD, on a system equipped with AMD 5th Gen EPYC processor and Leo CXL Smart Memory Controllers. ZenDNN is a deep neural network acceleration inference library, comprised of a set of fundamental building blocks and APIs to enhance AI inferencing applications.
Here are the key components of the test system:
Component | Description | QTY |
System | AS-4125GS-TNRT, 4U Supermicro System, supporting up to 8 DW AICs | 1 |
CPU | AMD 5th Gen EPYC Processor | 1 |
Memory | 64GB DDR5-5600 Memory Modules | 12 |
CXL Memory Controller | Leo A1000 CXL Smart Memory Add-in Card | 4 |
CXL-attached Memory | 64GB DDR5-5600 Memory Modules | 8 |
With this hardware, two configurations were tested to optimize performance: batch threading and table threading. In batch threading, each core accesses multiple tables by parallelizing across the batches. In contrast, table threading pins each table to a specific core. These hardware and specified test configurations helped demonstrate the benefits of CXL for the DLRM workload.
The results show:
- Up to 73% boost in AI inferencing performance with CXL
- Up to 66% more memory bandwidth
- Plug-and-play support for CXL interleaving
CXL enables system designers to deliver high-performance, high-capacity solutions with low TCO and RAS features that are critical for hyperscale and enterprise deployments. Leo CXL Memory Controllers, in conjunction with the AMD EPYC Processor and DDR5 memory, have undergone intensive feature validation, thorough interoperability testing, and rigorous application-level testing, making them ready for cloud-scale deployment. Furthermore, developing a CXL-memory-aware dynamic page allocation policy could yield additional improvements in the efficacy of using CXL for DL models.
Conclusion
Leo CXL Memory Controllers increased memory capacity and bandwidth by 66% to supercharge DLRMs, thereby boosting AI inferencing performance by 73%. In the AI market, where data-intensive applications demand faster processing and larger memory pools, Astera Labs and AMD have developed a compelling solution to enhance DLRM efficiency. This improvement enables higher-quality recommendations for advertisements, product placements, social media updates, and beyond.