InfiniBand Is Not Magic. Here's What It Actually Does.
The Network You've Never Needed. Until Now.
If you've spent your career building web services, InfiniBand sounds like something from a different universe. You know TCP/IP. You know Ethernet. You probably know enough about BGP to have opinions. InfiniBand? That's the thing HPC people use. Not your problem.
Except now it is. AI training workloads are everywhere. Distributed model training, large-scale simulation, financial risk calculations — these workloads need latency measured in microseconds, not milliseconds. Ethernet can't deliver that.
I've worked with high-speed HPC fabrics across three organisations — InfiniBand partition automation at GreenQloud, Intel Omni-Path Virtual Fabric multi-tenancy at Advania's HPCFLOW, and InfiniBand driver support as a product feature at Canonical's MAAS. Omni-Path is InfiniBand-adjacent but not InfiniBand, and that distinction ended up mattering more than I expected. This is the explanation I'd give a competent network engineer on their first day managing an HPC cluster.
The Numbers That Matter
InfiniBand is a network interconnect optimised for throughput and low latency between a relatively small number of endpoints. It's not a replacement for Ethernet. Different problem.
| Metric | 100 GbE Ethernet | HDR InfiniBand (200 Gb/s) | NDR InfiniBand (400 Gb/s) | |--------|-------------------|---------------------------|---------------------------| | Bandwidth | 100 Gb/s | 200 Gb/s | 400 Gb/s | | Latency | 10-50 us | 0.6-1.5 us | 0.5-1.0 us | | CPU overhead | High (kernel stack) | Near-zero (RDMA) | Near-zero (RDMA) |
That 10-50x latency difference matters when your distributed training job synchronises gradients across 256 GPUs every few hundred milliseconds. Each synchronisation round-trip adds to total training time. Over thousands of iterations, the gap between Ethernet and InfiniBand translates to hours or days of wall clock time.
But latency isn't even the most important number in that table. It's CPU overhead.
RDMA: Why InfiniBand Exists
RDMA — Remote Direct Memory Access — is the actual breakthrough. Not the bandwidth. Not the latency. The kernel bypass.
In a normal TCP/IP flow, sending data involves six steps: application writes to socket buffer, kernel copies to network stack, NIC DMAs from kernel memory, remote NIC receives, remote kernel copies to socket buffer, remote application reads. Two kernel copies. Two context switches. CPU involvement on both sides.
With RDMA: application registers a memory region with the HCA (Host Channel Adapter — InfiniBand's equivalent of a NIC). The HCA reads directly from application memory, writes directly to remote application memory. Done.
No kernel involvement. No data copies. No context switches. The HCA handles everything in hardware. This is why InfiniBand latency is sub-microsecond — the entire kernel network stack is bypassed.
For MPI workloads, this means MPI_Send and MPI_Recv become near-zero-overhead memory transfers. A distributed simulation exchanging boundary data between hundreds of nodes every timestep benefits enormously. The CPU cycles that would go to network processing go to actual computation instead.
RDMA also works over Ethernet (RoCE — RDMA over Converged Ethernet), but the latency is higher and you need lossless Ethernet configuration (PFC/ECN) because RDMA's loss-sensitive nature conflicts with Ethernet's lossy design. That adds operational complexity. Use RoCE if you can't justify InfiniBand's cost. Use InfiniBand if latency matters.
The Subnet Manager: InfiniBand's Centralised Brain
Here's where InfiniBand feels alien if you're from the Ethernet world. InfiniBand networks need a centralised Subnet Manager (SM) to configure routes, assign addresses (LIDs), and manage fabric topology.
This is fundamentally different from Ethernet, where switches operate independently with spanning tree or ECMP. In InfiniBand, the SM discovers all devices, computes optimal routing, and pushes routing tables to every switch and HCA. It's a single point of coordination.
If the SM goes down, traffic continues on existing routes but no new connections can be established. You always run a standby SM on a second device.
OpenSM is the reference implementation. NVIDIA ships their own (UFM) with better routing algorithms for large fabrics and a management UI. For small fabrics under 50 nodes, OpenSM is fine. Beyond that, the routing quality matters and UFM earns its licence fee.
Multi-Tenant InfiniBand: Harder Than It Sounds
InfiniBand partitions (P_Keys) are the multi-tenancy mechanism — conceptually similar to VLANs but implemented at the HCA level. Each partition is a 16-bit key restricting which endpoints can communicate.
At GreenQloud in 2015, I built automated partition management that integrated with our CloudStack deployment. When a customer provisioned bare metal nodes, the system automatically configured InfiniBand switch partitions to isolate their traffic. Multi-tenant InfiniBand wasn't a common pattern, and the automation tooling didn't exist. We wrote it from scratch.
The subtlety most people miss: partition keys interact with QoS. InfiniBand supports up to 16 virtual lanes on a single physical link, providing traffic isolation. This matters when you run MPI traffic and storage traffic over the same fabric. The latency-sensitive MPI traffic shouldn't be blocked by bulk data transfers. Getting the QoS policy right is the difference between a fabric that works and one that works until it doesn't.
Intel Omni Path: A Cautionary Tale in Vendor Risk
When I was building HPCFLOW at Advania (2018-2021), we used Intel Omni Path Architecture (OPA) — Intel's competitor to InfiniBand. The technology was solid: 100 Gb/s per port, good latency, and Intel's fabric management was simpler for our use case.
The most technically challenging work was building Omni Path Virtual Fabric multi-tenancy support. Virtual Fabric virtualises the fabric into isolated segments, similar to InfiniBand partitions but with Intel's own management model. I integrated this into OpenStack Neutron's ML2 plugin, so that when a tenant provisioned bare metal nodes through Ironic, the Omni Path Virtual Fabrics were automatically configured.
Then Intel exited the HPC networking business, selling Omni Path to Cornelis Networks.
The lesson is straightforward. If you're choosing an HPC fabric today, NVIDIA InfiniBand is the safe bet. They dominate the market. They own the GPU interconnect story — NVLink for intra-node, InfiniBand for inter-node. They're investing in higher speeds: NDR at 400 Gb/s, XDR at 800 Gb/s. Cornelis/Omni Path still works for existing deployments, but the ecosystem momentum is with NVIDIA.
Vendor lock-in in HPC networking is real and expensive. A fabric replacement is a forklift upgrade. Choose accordingly.
When You Actually Need InfiniBand
Use InfiniBand when:
MPI at scale. Distributed simulations, CFD, molecular dynamics. If your workload calls MPI_Allreduce thousands of times per second across dozens of nodes, latency dominates runtime. InfiniBand is the only option.
Distributed AI training. Multi-node GPU training with PyTorch or TensorFlow. Gradient synchronisation is an all-reduce operation — exactly the pattern InfiniBand excels at.
Parallel file systems. Lustre, GPFS, and BeeGFS use RDMA for high-bandwidth storage access. A cluster with InfiniBand for compute and Ethernet for storage will bottleneck on the storage network.
Ethernet is fine when:
Single-node GPU workloads. If training fits on one node, GPUs communicate over NVLink. The network is irrelevant.
Embarrassingly parallel jobs. Parameter sweeps, independent simulations, batch inference. No inter-node communication means no latency sensitivity.
Small clusters. Under ~16 nodes, the latency difference often doesn't justify the cost and operational complexity.
Kubernetes services. Web APIs, microservices, databases. Not latency-sensitive at the microsecond level.
What Production Taught Me
Cable management matters more than you think. InfiniBand cables are thick, stiff, and expensive. A 2-metre QSFP56 cable costs $50-200 depending on type (DAC vs active optical). Plan your rack layout around cable lengths. I've seen teams buy 100 active optical cables at $200 each because their switch was at the wrong end of the row. That's $20,000 in avoidable cost.
A single bad cable degrades the entire fabric. InfiniBand switches export counters for port errors, congestion, and link flaps. Feed these into Prometheus. One cable causing retransmissions affects every job on the fabric, not just the nodes connected to that cable. Finding it without monitoring is a needle-in-a-haystack exercise.
Test with real workloads, not benchmarks. ib_read_bw and ib_send_lat validate hardware, but they don't predict application performance. The improvement varies dramatically by workload communication pattern. I've seen workloads with 10x speedup and workloads with 1.2x speedup on the same fabric. Run your actual MPI codes and measure wall clock time.
Don't mix fabrics for latency-sensitive workloads. If MPI traffic goes over InfiniBand but your parallel file system goes over Ethernet, storage access becomes the bottleneck. Either put storage on InfiniBand too, or accept that I/O-heavy phases will be slow. Mixing fabrics is a compromise that looks sensible in the architecture diagram and hurts in production.
InfiniBand isn't as mysterious as the terminology makes it seem. It's a high-performance interconnect that matters for a specific class of workloads. The convergence of cloud-native and HPC infrastructure means more software engineers will encounter it. The learning curve is mostly vocabulary — the networking concepts underneath are the same ones you already know, just with less kernel and more hardware.