As high-performance computing (HPC) systems grow, optimizing communication locality becomes essential for performance. HPC networks are often oversubscribed, consisting of fully connected groups that are sparsely connected. We introduce Binomial Negabinary (Bine) trees, a novel approach to enhance collective operations by reducing inter-group communication. They minimize the distance between communicating ranks, reducing traffic on global links and alleviating congestion. Unlike traditional hierarchical algorithms, Bine trees are topology-agnostic and do not assume a uniform partition of ranks, making them ideal for production supercomputers with irregular process allocations. We design algorithms for eight collectives, achieving up to 5x speedups and 33% less global traffic on four supercomputers with four different topologies. Our results emphasize their effectiveness in improving performance while reducing the load on global links.
Bine Trees: Enhancing Collective Operations by Optimizing Communication Locality / De Sensi, Daniele; Pasqualoni, Saverio; Piarulli, Lorenzo; Tommaso Bonato, And; Ba, Seydou; Turisini, Matteo; Domke, Jens; Hoefler, Torsten. - (2025). (Intervento presentato al convegno International Conference for High Performance Computing, Networking, Storage and Analysis (was Supercomputing Conference) tenutosi a St. Louis, USA).
Bine Trees: Enhancing Collective Operations by Optimizing Communication Locality
Daniele De Sensi;Saverio Pasqualoni;Lorenzo Piarulli;
2025
Abstract
As high-performance computing (HPC) systems grow, optimizing communication locality becomes essential for performance. HPC networks are often oversubscribed, consisting of fully connected groups that are sparsely connected. We introduce Binomial Negabinary (Bine) trees, a novel approach to enhance collective operations by reducing inter-group communication. They minimize the distance between communicating ranks, reducing traffic on global links and alleviating congestion. Unlike traditional hierarchical algorithms, Bine trees are topology-agnostic and do not assume a uniform partition of ranks, making them ideal for production supercomputers with irregular process allocations. We design algorithms for eight collectives, achieving up to 5x speedups and 33% less global traffic on four supercomputers with four different topologies. Our results emphasize their effectiveness in improving performance while reducing the load on global links.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


