Large server deployments for Artificial Intelligence (AI) and general-purpose computing in hyperscale data centers provide enormous benefits in terms of raw compute power, efficiency, and cost amortization. The on-demand nature and low up-front cost of cloud computing is attractive to an increasing number of enterprises. However, managing such a large fleet of systems presents complex challenges of observability, data collection, and fault isolation.
Astera Labs’ Intelligent Connectivity Platform which includes the COnnectivity System Management and Optimization Software (COSMOS) suite addresses those challenges by providing link management, fleet management, and reliability/availability/serviceability (RAS) features across the entire product portfolio.
Leveraging a software-defined architecture, the Intelligent Connectivity Platform consists of semiconductor-based high-speed connectivity integrated circuits (ICs) and the COSMOS suite operating in on-chip microcontrollers and system baseboard/system management controllers (BMCs/SMCs). Working in concert, these solutions provide an array of customizable diagnostics and telemetry features useful for managing and optimizing a large fleet of systems.
COSMOS: Extensive monitoring and diagnostics capabilities
The COnnectivity System Management and Optimization Software (COSMOS) suite comprises:
- Software components running in the system software stack (Platform APIs)
- Software (firmware) modules running on-integrated microcontrollers in Astera Labs’ ICs
COSMOS provides three distinct capabilities:
- Link Management: COSMOS ensures secure and robust data link communications between devices in a computing system.
- Fleet Management: COSMOS allows real-time monitoring of and predictive maintenance for server fleets. Data from device-level software/firmware components is communicated to platform-level software to provide fleet-level insights and analytics.
- Reliability, Availability, Serviceability (RAS): COSMOS enables RAS capabilities by detecting, reporting, and testing the data, network, and memory links supported by Astera Labs’ products. Software component running on the ICs automatically detect and correct errors and simultaneously report telemetry data for performance and error monitoring.
Learn more: Request the white paper today
Advanced resource monitoring and fleet management have become critical in cloud environments that run AI workloads requiring parallel processing across thousands of servers. For these parallel workloads, efficiency may degrade the level of a single server running at the lowest utilization which has a significant impact on total cost of ownership (TCO) and return-on-investment (ROI) for the entire infrastructure.
Request the Cloud Infrastructure Fleet Management Made Easy with COSMOS white paper today to learn how COSMOS plays an important role in monitoring the all-important AI data center infrastructure at cloud-scale.