Skip to main content
Back to Blog

Cloud Engineers Can't Do HPC. HPC Engineers Can't Do Cloud. AI Needs Both.

9 min readHPC Infrastructure
hpccloudcareerbare-metalskills-gapslurm

Two Worlds That Pretend the Other Doesn't Exist

There are two kinds of infrastructure engineers, and they don't talk to each other.

Cloud-native engineers think in Kubernetes pods, Terraform modules, and CI/CD pipelines. They deploy stateless microservices, autoscale horizontally, and treat servers as disposable cattle. If a node dies, the scheduler moves the workload. The physical hardware is someone else's problem.

HPC engineers think in SLURM partitions, MPI ranks, and InfiniBand fabrics. They deploy tightly coupled simulations across hundreds of bare metal nodes with hand-tuned network topologies. If a node dies mid-job, the entire 500-node simulation restarts from the last checkpoint. Hardware is not someone else's problem — it is the problem. The difference between DDR4 and DDR5 memory bandwidth matters. The BIOS power management settings on each individual node matter.

I've lived in both worlds for over a decade. I built OpenStack-based HPC platforms with Kubernetes at v1.0 in 2016, when nobody in HPC was running it. I was Product Manager at Canonical where Ubuntu held 14% of HPC cluster deployments. Now I'm building a SLURM platform at Millennium Management with Go APIs, React frontends, and GitOps-managed monitoring — cloud-native tooling wrapped around bare metal infrastructure.

The gap between these communities is real. And AI is forcing them to converge whether they like it or not.

AI Broke the Wall

For twenty years, HPC and cloud existed in parallel universes. HPC lived in national labs and research universities. Cloud lived in startups and enterprise IT. The workloads were different, the users were different, the budgets were different. Nobody needed to cross the boundary.

Then large-scale AI training happened.

Training a frontier model requires exactly what HPC has been building for decades: bare metal GPU nodes, high-bandwidth low-latency interconnects, parallel filesystems, sophisticated job scheduling, topology-aware workload placement. But the teams building AI products come from the cloud-native world. They know PyTorch and Kubernetes. They don't know SLURM and InfiniBand.

The result is a skills gap costing real money. I've seen cloud teams try to run distributed training on Kubernetes and wonder why their GPU utilisation is 30%. The answer is usually that they're running collective operations over TCP/IP Ethernet instead of RDMA over InfiniBand, their scheduler doesn't understand GPU topology, and their storage layer can't feed data fast enough because they're using NFS instead of Lustre.

I've also seen HPC teams try to build self-service platforms and deliver bash scripts behind a wiki page. No API, no monitoring dashboards, no infrastructure-as-code, no CI/CD. They manage clusters with SSH and Ansible playbooks run from someone's laptop. It works until the team grows past five people or the cluster grows past fifty nodes.

Both communities have real expertise. Both have blind spots.

What Cloud Engineers Get Wrong About Bare Metal

If you come from the cloud-native world, here's what you're probably missing.

Bare metal is not "cloud without the abstraction." When I was building HPCFLOW on OpenStack Ironic, I spent half the first year on firmware: BIOS misconfigurations, IPMI edge cases, NIC firmware version mismatches, drives that report different capabilities depending on which PCIe slot they're in. None of this exists in the cloud because AWS already solved it for you. On bare metal, it's your problem.

The kernel matters. The CPU governor setting matters. NUMA topology matters. When your workload spans 64 cores across two sockets, whether you pin processes to the right NUMA domain determines whether you get 90% efficiency or 60%. That's not a marginal difference. That's a third of your compute wasted.

MPI workloads break cloud assumptions. Cloud architectures assume loose coupling. Microservices communicate over HTTP. If a request takes 5ms instead of 1ms, nobody notices.

MPI is tightly coupled. Hundreds of processes synchronise state every few milliseconds. An all-reduce operation across 256 GPUs waits for the slowest participant. If one node has a marginal InfiniBand link adding 50 microseconds of latency, every GPU in the cluster pays that penalty on every synchronisation step. Over a training run that takes days, those microseconds compound into hours.

Cloud engineers hear "microseconds" and think it doesn't matter. In MPI workloads, it's everything.

Job scheduling is not container orchestration. Kubernetes schedules pods that run for seconds to hours, mostly independently. SLURM schedules jobs that run for hours to days, with complex inter-dependencies, fair-share scheduling across research groups, preemption with checkpoint/restart, backfill that packs small jobs into gaps left by large reservations, and topology-aware placement ensuring MPI ranks are on nodes connected through the fewest switch hops.

Kubernetes is getting better at batch workloads — Kueue, Volcano. But these are bolted onto a system designed for services. SLURM was built for batch from the ground up with two decades of production hardening behind it.

The network is the computer. In cloud, you get 25 Gbps Ethernet and it's fine. In HPC, the interconnect is the single most important infrastructure decision you'll make. InfiniBand delivers 400 Gb/s at sub-microsecond latency with RDMA — bypassing the kernel network stack entirely. The difference between distributed training over Ethernet and InfiniBand can be 3-5x in total training time.

I've deployed InfiniBand fabrics at three organisations and driven InfiniBand support into Canonical's MAAS product. It's a different world from TCP/IP, with its own management model (Subnet Manager), its own isolation mechanism (partition keys), and its own failure modes. Cloud engineers who've never touched it will struggle.

What HPC Engineers Get Wrong About Operations

The traffic goes both ways. HPC infrastructure management is stuck in the 2010s at many organisations.

Infrastructure as code is not optional. I've worked with HPC teams that manage cluster configuration through wiki pages, tribal knowledge, and Ansible playbooks on one person's laptop. When that person goes on holiday, nobody can make changes.

At Millennium, every piece of infrastructure is defined in code, reviewed in pull requests, and deployed through automation. Not because we're "doing cloud things." Because it's the right way to manage any complex system.

APIs beat CLIs for tooling. The SLURM ecosystem has been built on CLI parsing for decades. squeue -o "%i %j %u %T", parse stdout, hope the output format hasn't changed between versions. Every monitoring tool, every dashboard, every automation script starts with subprocess.run(["squeue", ...]).

SLURM's REST API (slurmrestd) changes this. I've built four tools on it — a Go SDK, a terminal UI, a Prometheus exporter, and an Airflow provider. None parse CLI output. None need SLURM binaries installed.

The REST API is the bridge between HPC and cloud-native tooling. It turns SLURM from a system you manage with SSH into a service you integrate with programmatically. HPC engineers who aren't using it yet are missing the biggest modernisation opportunity in the SLURM ecosystem.

Observability is more than Nagios. Many HPC clusters are still monitored with Nagios checks and email alerts. Or worse — someone SSHs in and runs squeue manually to see if things look okay.

The slurm-exporter I built exposes 50+ Prometheus metrics from SLURM — job states, node utilisation, partition health, queue wait times. Grafana dashboards make cluster status visible to everyone, not just the two people with SSH access to the head node. This is table stakes in the cloud-native world. HPC needs to catch up.

The Career Opportunity at the Intersection

Here's the part nobody talks about: the intersection of cloud-native and HPC skills is one of the best career positions in infrastructure engineering right now.

The supply of people who understand both Kubernetes and SLURM, both Terraform and InfiniBand, both Go APIs and MPI, is vanishingly small. I can count on two hands the number of engineers I've met who can design a GitOps pipeline for SLURM cluster configuration and also debug an InfiniBand fabric performance issue.

The demand is enormous and growing. Every major cloud provider is building HPC offerings — AWS ParallelCluster, Azure CycleCloud, GCP HPC Toolkit. Every AI lab needs infrastructure engineers who understand distributed training at the hardware level. Every financial institution running quantitative workloads needs the same hybrid expertise.

When I was at Canonical, the HPC customers who struggled most were the ones trying to hire. They needed people who could set up SLURM and also build CI/CD pipelines. People who could tune InfiniBand and also write Terraform. Those people didn't exist in the job market because the two career paths had been completely separate for twenty years.

That's changing. And if you're in either camp, investing in the other side's skillset is one of the highest-leverage career moves you can make.

What I'd Do Differently

Looking back at my own path across the boundary, I wish I'd learned Kubernetes earlier and SLURM's REST API earlier. I was an early adopter of Kubernetes in 2016, but I treated it as separate from my HPC work for years. The moment I started applying cloud-native patterns to HPC infrastructure — GitOps, structured APIs, Prometheus observability — the quality of the systems I built jumped dramatically.

I also wish I'd been louder about the convergence sooner. I was building at the intersection for years before I started writing about it. The SLURM ecosystem needs modern tooling — client libraries, exporters, UIs, automation frameworks. Every tool anyone builds in this space moves the needle because the space is so underserved.

The Convergence Is Inevitable

The line between HPC and cloud infrastructure is disappearing. Not because one side won, but because the workloads demanded it. AI training needs bare metal performance with cloud-native operations. Scientific computing needs reproducible environments with HPC scheduling. Financial computing needs InfiniBand-class networking with GitOps-managed configuration.

The engineers who bridge both worlds will build the next generation of infrastructure. The ones who stay in their lane will find that lane getting narrower every year.

I've spent a decade crossing this boundary. It's the single best investment I've made in my career.