They Said Multi-Tenant InfiniBand Was Impossible. We Did It Anyway.

"Multi-tenant HPC? That's impossible. You can't isolate InfiniBand."

I heard some version of this on almost every sales call when I was building HPCFLOW at Advania. Vendors said it. Academics said it. Experienced HPC engineers — people I respected — said it. The orthodoxy was settled: HPC clusters are trusted environments, shared by known users, managed by a single organisation. Multi-tenancy meant VMs and performance penalties. Real HPC didn't work that way.

I disagreed. And eventually we proved it, in production, at scale.

Why the Orthodoxy Made Sense

Traditional HPC clusters are not designed with adversarial tenants in mind. Users typically have SSH access to compute nodes. Jobs run with minimal isolation — the assumption is that everyone on the cluster is on the same team. The network fabric, whether InfiniBand or Omni-Path, is a flat shared resource: any node can talk to any other node, and the hardware is tuned for raw throughput rather than access control.

This model works when your hundred users are researchers at the same institution. It completely falls apart when those hundred users represent competing pharmaceutical companies, or when a financial services client demands the same isolation guarantees they'd get in a dedicated data centre.

The core problems were specific: network fabric isolation, RDMA memory protection, GPU partitioning, and doing all of it without touching the performance envelope. Any one of those would have been a difficult engineering problem. All four together, simultaneously, is why everyone said it was impossible.

The Fabric Problem

InfiniBand was designed for a single-tenant world. Queue pairs communicate freely across the fabric. There's no native concept of tenant isolation at the hardware level — you'd need software enforcement, which is both a performance hit and a trust boundary problem. Software can be misconfigured or bypassed. When the isolation is in the NIC itself, it isn't.

The breakthrough, when it came, arrived from an unexpected direction.

Intel's Omni-Path Architecture introduced Virtual Fabrics — hardware-enforced fabric partitioning — support that I don't think Intel fully understood the commercial implications of when they shipped it. The idea was hardware-enforced fabric partitioning: each Virtual Fabric is a logically isolated network, tagged at the packet level by the hardware, invisible to other Virtual Fabrics. A node in Virtual Fabric A cannot communicate with a node in Virtual Fabric B. The isolation isn't enforced by a software rule or a firewall — it's enforced by the NIC on every packet.

I was the first to integrate this into a production multi-tenant environment, wiring it into OpenStack Neutron so that tenant provisioning automatically allocated a Virtual Fabric. When a customer job started, their nodes came up already in a hardware-isolated fabric. When the job ended, the fabric was released. From the customer's perspective it looked like a dedicated cluster. From the fabric's perspective, there was no cross-tenant communication possible, by construction.

The performance impact was essentially zero. Hardware doesn't lie about this the way software does. The isolation happens at the silicon level, not in a kernel path that adds microseconds.

RDMA and the Memory Protection Problem

Solving the fabric was necessary but not sufficient. RDMA — Remote Direct Memory Access — is the reason InfiniBand matters for HPC. MPI applications use RDMA to transfer data between nodes directly, bypassing the CPU and OS kernel on both sides. The latency numbers are real: around 1 microsecond for a small message, versus 20-30 microseconds with TCP. HPC performance depends on this.

The problem is that RDMA, by design, bypasses the normal memory protection boundaries that operating systems enforce. A process registers a memory region with the HIC, gets a protection key back, and passes that key to remote nodes. Those remote nodes can then read or write that memory region directly. If you have two tenants on the same fabric, and one of them gets hold of the other's protection key — or if the system is misconfigured — the hardware will happily perform the RDMA transfer. The kernel never sees it.

The solution was hardware memory windows, combined with Virtual Fabric enforcement. Each tenant's RDMA regions are registered with a protection domain that is scoped to their Virtual Fabric. The NIC validates the Virtual Fabric tag on every RDMA operation against the protection domain. An RDMA request that crosses Virtual Fabric boundaries is rejected at the hardware level before it touches memory.

The end-to-end latency we measured in production was 1.002 microseconds per MPI message versus a single-tenant baseline of 1.000 microseconds — a 0.2% increase. That number held up under sustained load across multi-month production benchmarks. It wasn't theoretical. The isolation was real; the cost was negligible.

GPU Partitioning

GPU virtualisation has historically been the other performance disaster in multi-tenant HPC. Time-slicing GPU contexts adds 20-30 microseconds per context switch — fine for web serving, fatal for tight MPI loops that synchronise across thousands of cores.

NVIDIA's Multi-Instance GPU (MIG), available on A100 and later hardware, solved this with spatial rather than temporal partitioning. Each MIG instance gets dedicated streaming multiprocessors, dedicated memory bandwidth, and a dedicated slice of the L2 cache. There is no time-slicing. There is no context switching penalty. Two tenants on the same physical GPU are operating on physically distinct silicon.

The orchestration challenge was mapping MIG instances to jobs in a way that was efficient. MIG instances come in fixed sizes — a seventh, a third, a half of an A100, depending on configuration — and job GPU requirements don't always divide cleanly. We tracked available MIG capacity per node in SLURM's GRES (Generic Resource) accounting and built placement logic that tried to fill A100s efficiently before opening a new card. It was not a solved problem, and it's still not a solved problem in the general case. But for the workload profile we were serving, we got GPU utilisation consistently above 80%, which was better than most dedicated-cluster deployments were achieving.

The Architecture in Practice

The security model had four layers, each doing real work.

The hardware layer was the foundation. Virtual Fabric isolation on the fabric. MIG partitioning on GPUs. Intel TXT (Trusted Execution Technology) for measured boot, so we could attest that the firmware and kernel on each node hadn't been tampered with. None of this was software — if the hardware said a boundary existed, the boundary existed.

The kernel layer handled what hardware couldn't. SELinux mandatory access controls on the compute nodes. Seccomp filters on container runtimes to limit syscall surface. Linux namespaces for filesystem and process isolation. We ran bare metal — no hypervisor — because the hypervisor layer would have added 5-10% overhead and we'd already solved the isolation problem at hardware. A hypervisor on top would have been security theatre at that point.

The network layer added an encrypted overlay for traffic that needed to traverse infrastructure outside the tenant's Virtual Fabric — management traffic, storage traffic, anything that touched shared infrastructure. IPsec with per-tenant keys. The overhead here was measurable but acceptable: the data paths that mattered for performance (MPI, RDMA-based storage) ran inside the Virtual Fabric and weren't affected.

The audit layer logged everything. Every SLURM job submission, every API call, every node provisioning event. Not as an afterthought — the audit trail was part of the compliance story, and the compliance story was part of why financial services and healthcare customers were willing to pay for what we'd built.

What I Would Do Differently

The Virtual Fabric integration with OpenStack Neutron was the right architectural decision, but the implementation was painful in ways I didn't anticipate.

Neutron's network model is designed around Ethernet — VLANs, VXLANs, OVS bridges. Mapping that model to fabric virtualisation concepts required building translation layers that had no upstream precedent. We were reading Intel driver code and Neutron internals simultaneously, trying to figure out what the right abstraction even was. The plugin I ended up writing worked, but it was brittle in ways that only showed up during fabric topology changes. If I were doing it again, I'd push harder on a clean separation — Virtual Fabric lifecycle managed by a purpose-built service that Neutron called into, rather than a Neutron ML2 plugin trying to speak both languages at once.

The MIG utilisation problem also deserved more engineering investment than we gave it. We treated it as a scheduling concern and handed it to SLURM's GRES accounting. SLURM is not designed for heterogeneous sub-device scheduling. The result was that job placement worked, but bin-packing efficiency was lower than it should have been. A purpose-built MIG scheduler, or at least a smarter backfill algorithm that understood MIG geometry, would have meaningfully improved GPU utilisation.

The hardest part was none of the above. The hardest part was convincing the first customer that the isolation was real. We could show them the architecture. We could run penetration tests. We could walk them through the hardware attestation chain. The trust problem was not a technical problem — it was a credibility problem. We solved it by publishing the architecture, running an external security audit, and eventually offering a dedicated-cluster trial period alongside the shared environment so customers could compare the performance characteristics themselves.

That's the thing nobody told me going in. The engineering was difficult. The trust was harder.

What We Actually Achieved

Zero security breaches across three years of production operation. SOC 2 Type 2 certification. HIPAA and PCI DSS compliance. MPI latency overhead of 0.1-0.2% versus single-tenant baseline. GPU performance with MIG that matched dedicated allocation, because it was dedicated allocation — spatially rather than temporally.

Financial services customers ran risk models that they couldn't have run cost-effectively on dedicated hardware. Biotech customers ran simulations that would have required months on a university cluster or significant spend on AWS. The economics worked because the isolation was real enough that customers trusted shared infrastructure.

Multi-tenant HPC isn't impossible. It requires hardware that actually supports isolation — not software approximations of it — and an architecture that starts from the hardware features rather than bolting security on top. It requires accepting that some engineering problems don't have clean upstream answers, and that you'll be reading driver source code more than you'd like.

It also requires being willing to ignore the orthodoxy long enough to find out whether it's actually right.

It wasn't.