Skip to main content
Back to Blog

Self-Service HPC: The Gap Between the Dream and the Reality

9 min readPlatform Engineering
hpcplatform-engineeringself-servicehpcflowmillenniumslurm

It's Friday at 5pm. A researcher messages you: "Can I get a cluster with 64 nodes and GPU support by Monday morning? Standard ML stack." You know this request. You've handled dozens of them. The correct answer involves a ticket, two approvals, a weekend, and a prayer that provisioning doesn't break over the weekend. If the researcher is lucky, they get their cluster by Wednesday.

I built self-service HPC platforms to stop that from happening. At HPCFLOW — starting from scratch at Advania and scaling it to multi-region HPC-as-a-Service at atNorth — the whole product was this problem. Then at Millennium, where the users aren't researchers but quantitative analysts and portfolio managers, I built it again for a different context with different stakes.

Here's what I actually learned. Not the conference-talk version. The version where you hit the wall repeatedly before something clicks.

The Ticket Prison Is Real, But the Numbers Aren't

Every self-service platform pitch I've seen leads with metrics: tickets per week before, tickets per week after, satisfaction scores, cost savings. I'm not going to do that.

I don't have clean before-and-after numbers from HPCFLOW. The data was scattered, the baseline moved while we built the platform, and anyone who tells you they went from 42% to 94% user satisfaction in 18 months is either running very well-designed surveys or making things up. In my experience, HPC operations don't track that data rigorously until after they've decided to change something — by which point the baseline is gone.

What I can tell you is what the ticket queue actually looked like. Most requests were repetitive and simple: more nodes for an existing job, a software package installed, access to a GPU partition, a job submitted because the user hadn't bothered learning sbatch syntax. Brilliant engineers were handling these. The engineers were miserable. The users were also miserable because even simple things took days.

That's the real problem: not the ticket count, but the fact that a two-minute task had a three-day queue in front of it. The friction was asymmetric. It cost the user almost nothing to file a ticket — it cost the platform team real hours to process it.

Self-service, when it works, inverts this. The user pays the friction cost upfront by learning the interface. The platform team stops paying it every time.

What HPCFLOW Actually Was

HPCFLOW at Advania started as a provisioning layer over OpenStack Ironic. Bare metal HPC — not virtual machines, not Kubernetes pods, actual physical nodes provisioned on demand for SLURM workloads. In 2016, when I was integrating Ironic with SLURM, there weren't many people doing this. There was no Stack Overflow answer for the specific failure modes we hit.

The self-service layer came later, and it was more modest than the architecture diagrams make it sound. We had a web portal, a SLURM-compatible CLI, and a provisioning API. The portal handled the common cases: pick a partition, specify your resources, submit. The API let power users script their own workflows. The CLI wrapped it all in something that felt familiar to anyone who'd used SLURM before.

The decision to keep the SLURM interface — instead of abstracting it entirely — was deliberate. HPC users are opinionated about their tools. They've spent years writing job scripts. If your "self-service platform" means relearning how to submit jobs, you've added friction rather than removed it. Meet people where they are.

The stack underneath — OpenStack Ironic for provisioning and booting, Neutron for network setup, Ansible for configuration, custom orchestration glue — was invisible to users. That invisibility was the product. When it worked, users saw a cluster appear. When it didn't work, they saw an error message and filed a ticket anyway.

The Millennium Context

Millennium is different from a research HPC environment in ways that matter for platform design.

The users are quants and portfolio managers, not researchers. They're not running week-long molecular dynamics simulations — they're running backtests, risk calculations, factor models. The workloads are often short, high-frequency, and latency-sensitive in a different way than traditional HPC. A quant researcher doesn't want to wait 45 minutes for provisioning before they can test a signal idea. The whole point of the signal is to act on it before it decays.

The compliance requirements are also categorically different. Every job has to be auditable. Data locality matters — you can't just spin up a cloud instance and send market data to it. The platform has to know what it's running and where.

This changes the self-service calculus significantly. Research HPC platforms can afford to let users experiment liberally with large allocations because the worst case is a failed experiment. In a trading infrastructure context, the guardrails aren't optional — they're the product. The self-service layer has to enforce them invisibly, without making users feel policed.

Getting that balance right is harder than building the provisioning engine. The provisioning engine has a spec. The policy layer has stakeholders.

The Tension: Defaults vs. Control

Here is the thing nobody puts in the conference talk: HPC users want sensible defaults until the moment they don't, and then they want absolute control with zero friction.

A new user submitting their first job on a self-service platform loves that they don't have to specify an interconnect type, choose a storage tier, or configure MPI — the defaults handle it. Six months later, that same user is angry because the default interconnect is InfiniBand and they need RoCE for their specific workload, and the UI makes them file a ticket to change it. The self-service platform has just recreated the ticket queue for advanced use cases.

The way I've seen this handled best — and it's still not perfect — is a layered interface with real exit hatches. The first layer handles 80% of jobs with zero configuration. The second layer exposes the scheduling and resource parameters that experienced users need. The third layer drops you into the raw SLURM interface with the platform-managed defaults pre-filled but editable. Each layer is a ratchet, not a ceiling.

The failure mode is when platform teams treat the third layer as an admission of defeat. If your "self-service" platform requires a ticket to access it, you've built a prison with nicer wallpaper.

Why Users Shadow-Build Around You

When the platform doesn't cover a use case, users don't wait. They provision their own cloud resources, buy credits, hack together workarounds. This is what gets called "shadow IT" in post-mortems, as though it's a behaviour problem rather than a platform failure.

Shadow usage is the platform team's most useful feedback signal, and most teams treat it as a compliance problem instead. At Millennium, the access controls make personal cloud accounts a non-starter. But I've seen users schedule workloads on GPU workstations with their own scripts — not because they were being reckless, but because it was easy and didn't require the rigour of the production environment. The interesting question isn't "how do we stop this?" It's "what about the production workflow was friction enough that a workstation felt easier?" The answer tells you more about your gaps than any user satisfaction survey.

The platforms that reduced shadow usage weren't the ones with stricter access controls. They were the ones that covered the use cases people were escaping to find. Obvious in retrospect. Painful in practice, because those use cases are always the awkward, high-maintenance ones that were left out of v1 for good reasons.

What I'd Do Differently

The hardest part wasn't building the provisioning engine. It was the social layer — getting users to trust the platform enough to change their behaviour.

At HPCFLOW, we launched with a technically solid provisioning API and almost no adoption for the first few months. The engineers who'd been handling tickets manually had relationships with their users. They understood the workloads. The platform had none of that. We had to build adoption the same way you build any tool adoption: find the power users who are willing to try things, make sure their experience is excellent, and let them drag everyone else over.

The mistake I made at HPCFLOW was treating the portal as the product. It wasn't. The product was the outcome — a cluster, a running job, a completed analysis. Users don't care about the portal. They care about getting their work done. The portal is just the least-friction path to that outcome. If users found a different path — a script, a CLI, a direct API call — that was fine. Adoption of the platform didn't have to mean adoption of the UI.

I also underestimated how long the transition period lasts. You don't flip from a ticket-based workflow to a self-service workflow overnight. Users and platform team both need time to adjust. During the transition, you're running both systems — the old ticket queue for edge cases and existing workflows, the new self-service layer for everything else. This is expensive and demoralising for the team, because it feels like building a new house while still maintaining the old one. But there's no shortcut. Cutting off the ticket queue before the self-service layer is ready to handle everything doesn't accelerate adoption — it creates a crisis.

The Metric That Actually Matters

If I had to track one thing to know whether a self-service HPC platform is working, it's this: how much time passes between when a user decides to run a workload and when the workload is actually running?

Not provisioning time. Not ticket resolution time. The full elapsed time including the user figuring out what to request, forming the request, waiting for it to be processed, getting access, and finally launching the job. That's the number that reflects the actual user experience.

When that number goes from days to minutes, you've built something that changes how people work. They run more experiments. They test more ideas. They don't wait until Monday to try the thing they thought of on Friday.

That's the whole point. The platform isn't the achievement. The change in how people work is the achievement. Build toward that, measure that, and don't let the architecture get mistaken for the goal.