Four HPC Platforms from Scratch. Here's What Actually Matters.

Four Blank Canvases

I've built four HPC platforms from nothing. Not "inherited a working system and added features" — staring at empty racks, writing the first line of automation, making every foundational decision that everything else would build on.

GreenQloud. HPCFLOW. Canonical's HPC portfolio. Millennium's quantitative computing platform.

Each one taught me something the previous one couldn't. And each time, I made mistakes I swore I'd never repeat — then repeated them anyway, in more sophisticated ways.

Platform 1: GreenQloud — Where Bare Metal Bit Back

2013-2016. Reykjavik, Iceland. Apache CloudStack + bare metal + InfiniBand.

GreenQloud was an Icelandic cloud company running Apache CloudStack. My job: turn it into something that could provision bare metal HPC clusters for enterprise customers — automotive CFD simulations, research workloads, the kind of jobs that laugh at virtual machines.

The first lesson hit immediately. Bare metal provisioning is nothing like VM provisioning. With VMs, you have an API, a hypervisor, and a clean abstraction layer. With bare metal, you have IPMI that hangs, firmware that lies about its version, and InfiniBand fabrics that need careful subnet manager configuration before a single packet flows.

I wrote bare metal provisioning for CloudStack in Java, including automated Ethernet and InfiniBand switch configuration for multi-tenant environments. Multi-tenant InfiniBand sounds straightforward until you realise the fabric has its own ideas about partition keys and QoS levels.

What I got right: Automating the network fabric from day one. Manual switch configuration doesn't scale past three customers.

What I got wrong: Underestimating firmware. Half the bugs in that first year were BIOS misconfigurations, IPMI edge cases, and driver version mismatches. I thought I was building a cloud platform. I was actually building a hardware compatibility database with a provisioning system bolted on.

GreenQloud was later acquired by NetApp. Building something that gets bought teaches you what scales and what doesn't — what enterprise buyers actually care about versus what engineers think they care about.

Platform 2: HPCFLOW — Sole Founder, Infinite Scope

2016-2021. Iceland to multi-region. OpenStack + Ironic + SLURM + Ceph.

HPCFLOW was the real 0-to-1 story. I was the sole founding engineer and architect, building a complete IaaS platform for HPC and AI workloads from an empty Git repository.

The technology stack was ambitious. OpenStack for the control plane. Ironic for bare metal provisioning. SLURM for workload scheduling. Ceph for distributed storage. I adopted Kubernetes at v1.0 in 2016 for internal services — genuinely early, before most teams were considering it for production.

The hardest technical problem was multi-tenant networking on Intel Omni Path fabrics. I built vFabric multi-tenancy support, implementing fabric virtualisation integrated with OpenStack Neutron's port allocation model. I also wrote two custom switch implementations for OpenStack's network-generic-switch ML2 plugin — one for HPE Flex Fabric and one for Cumulus OS. These integrations didn't exist because the Venn diagram of "understands OpenStack Neutron internals" and "understands Omni Path fabric management" is approximately me and three other people on the planet.

The platform scaled to multi-regional HPC-as-a-Service. We ran Community Ceph and Red Hat Ceph across multi-year operational cycles. We collaborated with Stanford University's Living Heart Project, providing infrastructure for cardiac simulations that resulted in published research.

What I got right: The storage architecture. Ceph was the right call. Investing early in understanding its failure modes — specifically the interaction between PG count, replication factor, and recovery I/O — saved us from data loss events that kill platforms. Also: building the automation layer before building the platform on top of it. Every manual step you skip "because it's faster right now" compounds.

What I got wrong: Scope management as a solo founder. When you're the only engineer, everything is your problem. Networking, storage, compute, monitoring, customer onboarding, documentation — the list never shrinks. I should have pushed for a second engineer earlier. The product was good, but I was the single point of failure for too long.

The bus factor was one. I knew it. I lived with it longer than I should have.

Platform 3: Canonical — The Product Lens

2022-2023. Remote/global. Ubuntu HPC + MAAS.

The odd one out. I wasn't building infrastructure — I was Product Manager of HPC at Canonical, reporting to leadership. I also took the interim PM role for MAAS (Metal as a Service), because my HPC experience made me the obvious person to define features like diskless booting, InfiniBand support, and advanced cluster delivery.

The platform I "built" here was a product portfolio: an extended Ubuntu package ecosystem, automated SLURM-based cluster deployment, and reference architectures co-published with leading OEMs. The most impactful work was the NVIDIA partnership — pushing to improve InfiniBand and GPU driver availability on Ubuntu. Business and technical work intertwined.

What I got right: Bringing engineering context into product decisions. Most PMs in infrastructure can't argue about InfiniBand subnet manager configuration or the tradeoffs between diskless and disk-based HPC node provisioning. I could talk to customers and engineers in the same language. That made the reference architectures genuinely useful instead of marketing material.

What I got wrong: Underestimating how slow large organisations move. At HPCFLOW, I shipped features in a week. At Canonical, getting a five-year roadmap approved took months. Both speeds are valid for different contexts. The adjustment was rough.

The lesson I'd give every engineer: Spend time in a product role. Not because you'll become a PM, but because it changes how you evaluate engineering decisions. You start asking "who is this for?" before "what's the cleanest architecture?" That shift makes you a better builder.

Platform 4: Millennium — SLURM for Quant Finance

2023-present. London. SLURM + Go APIs + React + hybrid cloud.

Lead HPC Engineer at Millennium Management — founding engineer role — building a complete HPC platform from scratch for quantitative trading infrastructure. The users are trading teams, portfolio managers, quants, and data scientists operating at global scale.

This is the most technically ambitious of the four. SLURM v25.05.5 with munge authentication and Enroot containerisation. A full-stack self-service interface in Go (APIs, CLIs) and React (web frontend). A custom AWS EC2 plugin with multi-role support for hybrid cloud bursting. An Ansible-based automation framework with custom APIs for provisioning, dynamic inventory, and service discovery.

The unique constraint is the environment. Quantitative finance adds compliance, security, and operational requirements that don't exist in research HPC. Every architecture decision gets reviewed by trading technology, compliance, and operations. Change management isn't optional. Operational runbooks aren't nice-to-have — they're mandatory documentation for every platform component.

What I got right: Building the self-service layer from day one. In finance, the researchers and quants who use the platform are not going to SSH into head nodes and write SLURM scripts. They need APIs and web interfaces. The Go API layer and React frontend weren't afterthoughts — they were core platform components from the first sprint.

What I got wrong: Too early to say. Ask me in two years. If I had to guess, it'll be something in the hybrid cloud layer — the interaction between on-premises SLURM partitions and elastic cloud capacity reveals edge cases slowly, over months of real usage.

The Three Patterns That Never Change

After four platforms and a decade of this work, certain patterns are undeniable.

Automation is the platform. Every time, the most valuable code wasn't the compute layer or the scheduler integration — it was the automation. Provisioning scripts, configuration management, deployment pipelines. If you can't rebuild the platform from automation alone, you don't have a platform. You have a collection of manually configured servers that will diverge until they fail.

Storage is always harder than compute. Compute nodes are relatively stateless. One dies? Reimage it, add it back. Storage is state, and state is the hardest problem in distributed systems. Every build, I've spent more time on storage architecture than planned. Ceph, Lustre, NFS, object storage — the technology changes but the lesson doesn't. Get storage right first.

Networking kills you quietly. The most insidious production issues I've debugged were networking problems. Not outages — those are obvious. Subtle performance degradation from misconfigured MTU sizes. InfiniBand partitions that silently prevent MPI communication between specific node pairs. VLAN misconfigurations that only manifest under certain traffic patterns. Network problems make you question your sanity before you find the root cause.

The Three Mistakes That Never Stop

Building too much before shipping. Every single time, I've spent too long building before putting the platform in front of users. The confidence from running tests is not the confidence from running production workloads. I know this. I still do it.

Underestimating operational burden. Building a platform takes months. Operating it takes years. Every feature you add is a feature you maintain, monitor, debug, and document. The operational runbooks I wrote at Millennium aren't bureaucratic overhead — they're lessons from HPCFLOW encoded into process. When you're the one carrying the pager, you write better documentation.

Hiring too late. At HPCFLOW, I was the sole engineer longer than I should have been. At every subsequent build, I've tried to bring in a second engineer earlier. The transition from solo builder to team lead is one of the hardest parts of the founding engineer role. Delaying it makes it harder, not easier.

What Actually Matters vs. What Doesn't

After a decade, strong opinions.

The most important technology decision in any platform build is the one hardest to change later. Choose your storage architecture carefully. Choose your network fabric carefully. Choose your automation framework carefully. Everything else can be swapped.

The specific scheduler version, the specific Linux distribution, the specific container runtime — these are interchangeable. The automation and operational layers around them determine whether your platform succeeds. I've seen teams agonise for months over SLURM vs PBS Pro while their provisioning is still done by hand. That's optimising the wrong thing.

The Real Lesson

Start with the users, not the architecture. Understand what they need to do before you decide how to build it.

Automate everything, even if it's slower upfront. Manual steps are technical debt with compound interest.

Ship something incomplete to real users faster than feels comfortable. The first real user will invalidate half your assumptions in a week. At GreenQloud, it was an automotive CFD customer whose job submission patterns broke my scheduler assumptions. At HPCFLOW, it was a customer whose storage access patterns overwhelmed the Ceph cluster I'd "sized correctly." At Millennium, it was a researcher whose workflow required capabilities I hadn't anticipated.

Write operational documentation from day one. Future you — or whoever replaces you — will need it at 3 AM.

Four platforms. A decade. No CS degree. The pattern recognition gets better each time. The mistakes don't stop. They just get more interesting.