High-Performance Network Fabrics and Libraries

3 min readMay 2, 2024

Not long ago, high-performance network fabrics like InfiniBand, HPE Slingshot, or Intel Omni-Path (now CORNELIS) were primarily found in supercomputers or high-performance computing (HPC) clusters aimed at scientific applications. These fabrics were crucial for achieving lower latencies and higher bandwidth, particularly for applications needing to scale across thousands of computing nodes.

With the rise of GPU computing and Artificial Intelligence (AI) with deep learning neural networks, high-performance network fabrics have become an essential part of modern clusters. As with any network programming, directly programming these network fabrics is a tedious, error-prone task that requires deep knowledge of the specific fabrics. Unlike Ethernet and TCP, which have the socket programming interface, these networks expose different capabilities and protocols that make it even harder to program.

About 10 years ago the only way to access these networks without programming them directly using low-level libraries was to use a Message Passing Interface implementation like OpenMPI or MPICH. MPI implementations come with assumptions and constraints such as process management that make them not applicable to all applications. There are libraries available today to program these networks that give the same benefits as MPI but without some of its constraints. Following is a list of such libraries.

OpenUCX

Website — https://openucx.org/
Source code — https://github.com/openucx

UCX is a communication library developed for high-performance computing (HPC) and data-centric applications. It offers a unified API for communication operations across the network fabrics it supports such as InfiniBand, RoCE (RDMA over Converged Ethernet), and Ethernet.

UCX has support for Nvidia GPUs using CUDA and AMD GPUs using RoCm. This allows users to utilize hardware like Nvidia Nvlink to transfer messages between GPUs.

Further, OpenUCX supports shared memory transports. This means a single network API in UCX can transfer data between processes using shared memory inside a compute node, within GPUs if the data is in GPUs, and use Infiniband if the transfer is within separate compute nodes.

UCX supports different messaging semantics like stream-oriented send/receives, tag-matched send/receives, remote memory access, and remote atomic operations.

All these features make UCX an attractive library for high-performance application developers.

Libfabric

Website — https://ofiwg.github.io/libfabric/
Source code — https://github.com/openucx

Libfabric is similar to UCX in its APIs and capabilities. It has support for more high-performance network fabrics compared to UCX. Some of the supported fabrics and transports are

Cray’s Slingshot network
Amazon EC2 Elastic Fabric Adapter (EFA)
Omni-Path networking from Cornelis Networks
UCX — Libfabric can use UCX as a library underneath
Shared memory
Infiniband through verbs
TCP and UDP transports

NCCL

Website — https://developer.nvidia.com/nccl
Github — https://github.com/NVIDIA/nccl

NVIDIA Collective Communications Library (NCCL, pronounced “Nickel”) is specifically designed for machine learning and GPUs. It provides distributed networking operations used by ML applications implemented on top of high-performance network fabrics and Nvidia GPUs.

The operations it supports are the ones found in MPI implementations such as AllReduce, AllGather, etc which are essential for ML applications.

RCCL

Website — https://rocmdocs.amd.com/projects/rccl/en/latest/
Github — https://github.com/ROCm/rccl

RCCL (ROCm Communication Collectives Library) is a library developed for AMD GPUs. RCCL is similar to NCCL and supports the collective communications operations vital to ML applications.

Facebook Gloo

Github — https://github.com/facebookincubator/gloo

Gloo is similar to NCCL in its operations and is more generic than NCCL because it is not specific to Nvidia GPUs. It supports TCP, Infiniband, and RoCE. Gloo can use NCCL underneath for some of its operations when the data is in GPUs.

OpenUCX UCC

Github — https://github.com/openucx/ucc

Unified Collective Communication (UCC) is a library built by the OpenUCX community. It provides similar collective operations to MPI. This is a new library compared to the others discussed in this article.

It supports collective operations built using libraries like UCX, Sharp (Nvidia hardware collectives), CUDA, NCCL, and RCCL.

Summary

If you are building applications that need point-to-point communications, UCX or Libfabric are the most versatile and generic libraries available. UCX has native support for Nvidia fabrics (Infiniband) and GPUs. Libfabric has wider support for different fabrics.

If you are building ML applications and need specific operations NCCL, RCCL, UCC, or Gloo can be your choice. If you are only targeting Nvidia GPUs, NCCL is a popular approach and RCCL is the same for AMD GPUs. If you are looking at GPU agnostic solutions UCC can be a good choice.