Schneider Electric Warns That Existing Datacenters Aren’t Buff Enough For AI – Slashdot

[ad_1]

The infrastructure behind popular AI workloads is so demanding that Schneider Electric has suggested it may be time to reevaluate the way we build datacenters. The Register reports: In a recent white paper [PDF], the French multinational broke down several of the factors that make accommodating AI workloads so challenging and offered its guidance for how future datacenters could be optimized for them. The bad news is some of the recommendations may not make sense for existing facilities. The problem boils down to the fact that AI workloads often require low-latency, high-bandwidth networking to operate efficiently, which forces densification of racks, and ultimately puts pressure on existing datacenters’ power delivery and thermal management systems.

Today it’s not uncommon for GPUs to consume upwards of 700W and servers to exceed 10kW. Hundreds of these systems may be required to train a large language model in a reasonable timescale. According to Schneider, this is already at odds with what most datacenters can manage at 10-20kW per rack. This problem is exacerbated by the fact that training workloads benefit heavily from maximizing the number of systems per rack as it reduces network latency and costs associated with optics. In other words, spreading the systems out can reduce the load on each rack, but if doing so requires using slower optics, bottlenecks can be introduced that negatively affect cluster performance.

The situation isn’t nearly as dire for inferencing — the act of putting trained models to work generating text, images, or analyzing mountains of unstructured data — as fewer AI accelerators per task are required compared to training. Then how do you safely and reliably deliver adequate power to these dense 20-plus kilowatt racks and how do you efficiently reject the heat generated in the process? “These challenges are not insurmountable but operators should proceed with a full understanding of the requirements, not only with respect to IT, but to physical infrastructure, especially existing datacenter facilities,” the report’s authors write. The whitepaper highlights several changes to datacenter power, cooling, rack configuration, and software management that operators can implement to mitigate the demands of widespread AI adoption.

[ad_2]

Source link