Next-generation datacenters require highly efficient network load balancing to manage the growing scale of artificial intelligence (AI) training and general datacenter traffic. However, existing Ethernet-based solutions, such as Equal Cost Multi-Path (ECMP) and oblivious packet spraying (OPS), struggle to maintain high network utilization due to both increasing traffic demands and the expanding scale of datacenter topologies, which also exacerbate network failures. To address these limitations, we propose REPS, a lightweight decentralized per-packet adaptive load balancing algorithm designed to optimize network utilization while ensuring rapid recovery from link failures. REPS adapts to network conditions by caching good-performing paths. In case of a network failure, REPS re-routes traffic away from it in less than 100 microseconds. REPS is designed to be deployed with next-generation out-of-order transports, such as Ultra Ethernet, and uses less than 25 bytes of per-connection state regardless of the topology size. We extensively evaluate REPS in large-scale simulations and FPGA-based NICs.
REPS: Recycled Entropy Packet Spraying for Adaptive Load Balancing and Failure Mitigation / Bonato, Tommaso; Kabbani, Abdul; Ghalayini, Ahmad; Michael Papamichael, And; Dohadwala, Mohammad; Gianinazzi, Lukas; Khalilov, Mikhail; Achermann, Elias; De Sensi, Daniele; Hoefler, Torsten. - (2026). ( Eurosys Conference Edinburgh ).
REPS: Recycled Entropy Packet Spraying for Adaptive Load Balancing and Failure Mitigation
Daniele De Sensi;
2026
Abstract
Next-generation datacenters require highly efficient network load balancing to manage the growing scale of artificial intelligence (AI) training and general datacenter traffic. However, existing Ethernet-based solutions, such as Equal Cost Multi-Path (ECMP) and oblivious packet spraying (OPS), struggle to maintain high network utilization due to both increasing traffic demands and the expanding scale of datacenter topologies, which also exacerbate network failures. To address these limitations, we propose REPS, a lightweight decentralized per-packet adaptive load balancing algorithm designed to optimize network utilization while ensuring rapid recovery from link failures. REPS adapts to network conditions by caching good-performing paths. In case of a network failure, REPS re-routes traffic away from it in less than 100 microseconds. REPS is designed to be deployed with next-generation out-of-order transports, such as Ultra Ethernet, and uses less than 25 bytes of per-connection state regardless of the topology size. We extensively evaluate REPS in large-scale simulations and FPGA-based NICs.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


