Skip to main content
Back to Blog

From CloudStack Bare Metal to Containers: Thirteen Years of HPC Evolution

11 min readHPC Infrastructure
hpckubernetesbare-metalinfinibandcontainerscloud-native

Introduction

In 2011, the idea of running high-performance workloads in containers was laughable. "The overhead!" critics would say. "You'll never get InfiniBand working!" And they had a point — the tooling wasn't there, the abstractions were leaky, and anyone who tried paid for it in performance regressions and operational complexity.

I was at GreenQloud at the time, an Icelandic cloud startup built on Apache CloudStack. We were running bare metal compute nodes connected over InfiniBand, handling workloads that lived and died by microsecond-level latency. The idea of putting a container runtime between our MPI jobs and the hardware would have seemed like self-sabotage. You didn't add abstraction layers. You removed them.

Thirteen years later, containers are standard at Millennium Management and Kubernetes is part of the picture — but not in the way the industry assumed it would develop. The bare metal is still there. The InfiniBand is still there. And the SLURM control plane and compute nodes are still on dedicated hardware. What's changed is everything around them.

This is the story of how that changed — what drove it, what the hype got wrong, and what I'd do differently if I were starting fresh.

The Bare Metal Era and Why It Made Sense

The bare metal discipline in HPC wasn't stubbornness. It was physics.

InfiniBand switch latency sits around 100 nanoseconds. Ethernet is roughly 230 nanoseconds on good hardware. That gap matters enormously when you're running tightly coupled MPI jobs across 512 nodes and every message-passing round trip compounds. The overhead that virtualisation introduces isn't abstract — it's measurable, it's cumulative, and on the wrong workload it makes the cluster worthless.

At GreenQloud, I was extending Apache CloudStack with custom Java code to provision bare metal nodes, integrating it with our InfiniBand fabric. There were no Stack Overflow answers for this. The documentation assumed you were virtualising things. We weren't. The work was slow, painful, and frequently broke in ways that required understanding the CloudStack internals rather than just the configuration surface.

The utilisation problem was real even then. A cluster sitting at 40-50% utilisation is expensive waste — hardware that cost significant capital budget doing nothing for half its operational life. But the answer wasn't to add abstraction. It was to schedule better, allocate smarter, and accept that some idle capacity was the price of performance guarantees.

That logic held until the workload mix started changing.

Why VMs Never Worked in HPC

The enterprise world virtualised everything between 2008 and 2015. HPC centres tried to follow, and it was almost universally a bad decision.

The problem is structural. Virtual machine hypervisors introduce indirection at exactly the layers that HPC workloads are most sensitive to: CPU scheduling, memory access patterns, and especially networking. Virtualised networking costs you microseconds on every packet. When your MPI job is passing messages thousands of times per second across hundreds of nodes, that overhead compounds catastrophically. Early tests on AWS cluster compute instances showed 30-40% performance loss versus equivalent bare metal — and that was for workloads that weren't particularly latency-sensitive.

NUMA topology made it worse. Physical HPC nodes are architected carefully: memory is local to specific processor sockets, and a well-tuned MPI job knows which memory it's touching and where. A hypervisor obscures that topology. The job can't see the hardware it's actually running on, so it can't optimise for it.

Most HPC centres drew the correct conclusion: virtualisation was the wrong tool. The seeds of what came next were planted elsewhere.

Containers Changed the Right Things

Docker emerged around 2013, and my initial reaction was scepticism. Another abstraction layer. More overhead. We'd been through this.

I was wrong about the overhead. Containers are not virtualisation — they share the host kernel, which means the networking, CPU scheduling, and memory access patterns are essentially identical to running directly on bare metal. The containerisation overhead for CPU-bound workloads is negligible. For network-bound MPI jobs, it depends heavily on how you handle the network layer, but it's solvable in ways that VM networking fundamentally isn't.

What containers actually changed was dependency management and reproducibility. HPC clusters had always been nightmares to keep consistent. A researcher needs CUDA 11.2, OpenMPI 4.0, and a specific version of HDF5 compiled with particular flags. Another researcher needs a different version of OpenMPI because their code has a subtle dependency on an older API. Keeping these environments from destroying each other on shared bare metal required either module systems with careful version management or separate clusters — both painful in different ways.

Containers solved this cleanly. Package the entire software stack, run it in isolation, get reproducible results. The environment the researcher tested on their workstation is the same environment that runs on the cluster. This sounds obvious now. In 2015, it felt like a genuine breakthrough.

The Kubernetes Decision in 2016

In 2016, when I was at Advania building HPCFLOW — the platform I designed and built from scratch as the sole engineer initially — I started looking seriously at Kubernetes v1.0 for orchestration. Kubernetes for HPC was not a mainstream idea. It was designed for stateless web services: long-running pods, rolling deployments, load balancers in front of HTTP APIs.

HPC workloads are different in almost every way. Jobs are batch, not long-running. They're tightly coupled across many nodes. They need hardware locality — you can't have half your MPI job on one rack and half on another and expect good performance. They have specialised hardware requirements: GPUs, InfiniBand, high-memory nodes.

I ran it anyway. The scheduling model was wrong for our workloads, but the control plane — the ability to describe desired state declaratively, have it reconciled automatically, and observe what was happening through a single API — was genuinely valuable. We ended up running development and batch processing workloads on Kubernetes while keeping the latency-sensitive MPI jobs on SLURM-managed bare metal. Two schedulers. One cluster. It was messier than I'd like to admit, but it worked.

That two-tier approach turned out to be the right intuition, even if the implementation was rough.

Where Kubernetes Actually Fits

By the time I joined Millennium Management, the Kubernetes ecosystem had matured considerably for HPC-adjacent workloads: device plugins for GPU allocation, SR-IOV for near-native network performance, topology-aware scheduling. The tooling to run SLURM on Kubernetes — or to replace SLURM's control plane with Kubernetes — existed if you wanted it.

I don't use it that way, and I'm sceptical that the added complexity is worth it for most production environments.

The production SLURM stack at Millennium runs on dedicated bare metal. The control plane and compute nodes aren't Kubernetes-managed. Part of that is performance — dedicated clusters eliminate an entire category of variable when something behaves unexpectedly. But the bigger reason is operational clarity. SLURM's internal state model is already complex: fair-share accounting, backfill scheduling, topology-aware placement, job preemption, partition policies. When a job fails across hundreds of nodes at 2am, you want one system to reason through, not two. Kubernetes as an additional layer in that diagnostic path adds cognitive overhead that doesn't pay for itself.

Where Kubernetes earns its place is the surrounding infrastructure. Test environments, spinning up new cluster configurations quickly before they touch production, CI/CD pipelines, monitoring and tooling that needs to run independently of the SLURM scheduler. For those workloads it's genuinely excellent: declarative, fast to deploy, and I don't have to hand-manage lifecycle. But that's a different argument from "Kubernetes should manage your production SLURM cluster."

The distinction matters. Kubernetes in an HPC environment doesn't need to own the compute nodes to provide value. It's a good platform for the tooling layer and the test layer. SLURM manages what SLURM is designed to manage. That separation — rather than a layered hybrid where one orchestrates the other — is where I've landed after trying the alternatives.

The Cultural Shift That Actually Mattered

The biggest change wasn't technical. I didn't expect that.

In the bare metal era, researchers described their requirements in hardware terms: "I need 64 nodes with InfiniBand for two weeks." This is how they had to think, because the resource model was physically fixed. You got nodes, you ran your job, you released the nodes.

The shift to containers introduced a different abstraction: "I need 1000 cores and 10 terabytes of memory." The researcher stops caring which physical nodes they get, which rack they're in, whether the hardware is being reused from a finished job. They describe compute requirements, and the platform figures out placement.

This sounds like a small thing. It isn't. It unlocks a completely different utilisation model. When jobs describe abstract resource requirements rather than physical node allocations, the scheduler can pack them more densely, backfill more aggressively, and reuse capacity that would otherwise sit idle between allocations. That improvement is SLURM getting better at scheduling and users getting better at expressing what they actually need — not a Kubernetes contribution. The container model for reproducible environments removed a different bottleneck: researchers stopped waiting for HPC engineers to install software stacks.

The second cultural shift was self-service. Under bare metal SLURM, setting up a new researcher's environment meant an HPC engineer installing software, configuring modules, sometimes compiling libraries from source. Under containers, the researcher builds their own image, or uses a base image from a registry, and the platform runs it. The engineering team stops being a bottleneck.

This is where the productivity numbers that get cited in container-era HPC papers come from. The 3x developer productivity figure is real in my experience, but it's not because containers made computation faster. It's because researchers stopped waiting for infrastructure engineers to set things up.

What I Got Wrong

I oversold Kubernetes for HPC in 2016. I believed the scheduling model would mature faster than it did, and I underestimated how long it would take for the ecosystem to solve the networking problem properly.

SLURM is still the right scheduler for tightly coupled MPI workloads. The job preemption model, the fair-share accounting, the backfill scheduling — these are tuned for exactly the kind of batch HPC workloads that Kubernetes still struggles with. When I was pushing Kubernetes as a unified control plane for everything, I was wrong. It works brilliantly for some things and poorly for others, and the mistake was not being precise about which.

I also underestimated how hard the networking problem would be. SR-IOV for near-native RDMA performance over InfiniBand took years to become operationally mature in Kubernetes. The Multus and SRIOV Device Plugin ecosystem got there eventually, but anyone who tried to run serious MPI workloads over Kubernetes networking in 2017 or 2018 paid for it. I'd seen the potential and extrapolated too fast.

The Kubernetes hype cycle for HPC peaked around 2019-2021 with a lot of vendor claims about running TOP500-class workloads on Kubernetes natively. Most of those claims were true under narrow conditions — single-node GPU training, embarrassingly parallel batch processing — and didn't generalise to the tightly coupled MPI workloads that define HPC. The correction has been healthy. Most HPC teams now use Kubernetes for what it's genuinely good at and SLURM for what SLURM is genuinely good at, rather than trying to force one model to cover everything.

The hybrid approach isn't a compromise. It's the right architecture. I should have said that more clearly in 2016 instead of implying Kubernetes would replace SLURM.

What Actually Holds Up

The container model for dependency management and reproducibility. Completely vindicated. I have not met a serious HPC team in the last five years that doesn't use containers at some level, even if they're using Singularity rather than Docker because of the rootless execution model that HPC multi-tenancy requires.

The resource abstraction — describing compute in terms of cores, memory, and accelerators rather than physical nodes — made clusters genuinely more usable. The utilisation improvements are real.

Kubernetes for the surrounding infrastructure — test environments, tooling, CI/CD, spinning up new configurations before they touch production — genuinely useful in those roles. The mistake is treating that utility as a reason to run your SLURM control plane or compute nodes on Kubernetes. A dedicated SLURM cluster has operational clarity that only becomes apparent when you're debugging a large job under pressure. One system to reason about is worth something.

And bare metal is not going anywhere. The physics haven't changed. InfiniBand latency is still 100 nanoseconds. The workloads that need it still need it. Anyone who told you in 2020 that cloud instances would replace bare metal for latency-sensitive HPC was selling something.

Thirteen years in, my position is simpler than it used to be: containers solved the right problem, SLURM runs production, Kubernetes is useful at the edges. That's less architecturally satisfying than a clean layered model, but it's accurate — and accuracy usually wins over elegance in infrastructure that has to run under pressure.