HPCFLOW: Building an HPC-as-a-Service Platform from Zero

Executive Summary

As the founding engineer of HPCFLOW, I built a multi-tenant HPC platform from ground zero at Advania — integrating OpenStack, Ironic, Packer, Slurm, and Ceph with custom solutions to deliver HPC clusters as a service. When the platform was adopted by Advania Data Centers (later atNorth), I transitioned to CTO and scaled it from a single-region solution to multi-regional HPC-as-a-Service serving enterprise clients.

The Problem

Traditional HPC infrastructure presented significant barriers to entry:

Capital Requirements: Millions in upfront hardware investment
Expertise Gap: Shortage of HPC specialists
Utilization Challenges: Resources sitting idle 60-80% of the time
Scaling Limitations: Fixed capacity couldn't meet variable demand
Security Concerns: Multi-tenancy considered impossible for HPC

Small to medium enterprises and research groups were effectively locked out of HPC capabilities.

The Vision

Create an "AWS for HPC" - a platform where anyone could access supercomputing resources on-demand, paying only for what they use, without compromising on performance or security.

The Innovation

Technical Breakthroughs

1. True Multi-Tenant HPC

Challenge: Traditional HPC assumes single-tenant, trusted environments

Innovation:

First-ever implementation of Omni Path vFabric for network isolation
Hardware-level security without performance penalty
Complete tenant isolation at InfiniBand layer

2. Elastic Resource Management

Challenge: HPC workloads have unpredictable resource requirements

Innovation:

Dynamic resource allocation with sub-second provisioning
Intelligent scheduling predicting workload patterns
Automatic scaling based on queue depth and SLAs

3. Bare Metal Provisioning

Challenge: Containers/VMs add unacceptable overhead for HPC

Innovation:

Custom bare-metal provisioning for CloudStack
2-minute deployment of fully configured HPC nodes
Zero-overhead multi-tenancy

Platform Architecture

User Layer:       Web Portal | API | CLI
                         ↓
Control Plane:    Authentication | Billing | Monitoring
                         ↓
Orchestration:    Slurm | Custom Schedulers | CloudStack
                         ↓
Fabric Layer:     Omni Path vFabric | InfiniBand Partitioning
                         ↓
Compute Layer:    Bare Metal Nodes | GPU Clusters | Storage

Phase 1: Founding (2016-2018) — Advania

As sole founding engineer, I designed and built HPCFLOW from a blank repository:

Platform stack: OpenStack (Ironic for bare metal), Packer for golden images, Slurm for workload management, Ceph for distributed storage
Bare metal provisioning: Automated HPC cluster deployment from golden images using HPE Performance Cluster Manager (CMU)
Early Kubernetes: Adopted Kubernetes v1.0 in 2016 for internal services and monitoring — one of the earliest production deployments
Customer delivery: Led pre-sales and solution architecture, enabling enterprise clients to migrate to IaaS-based HPC at on-premises performance levels

Phase 2: Scale (2018-2021) — atNorth / Advania Data Centers

As CTO when HPCFLOW was adopted by Advania Data Centers (later atNorth), I scaled the platform to multi-regional HPCaaS:

Technical Breakthroughs

Omni Path vFabric Multi-tenancy — Pioneered network isolation at the InfiniBand fabric layer, integrating vFabric support directly with OpenStack Neutron's port allocation model. This solved what the industry considered an unsolvable problem: true hardware-level tenant isolation without performance penalty.

Custom Switch Implementations — Developed two switch implementations for OpenStack's network-generic-switch ML2 plugin, enabling multi-tenant bare metal networking for HPE Flex Fabric and Cumulus OS environments.

Intel Select Solutions — Achieved Intel Select Solutions for High-Performance Computing certification, validating platform performance against Intel's reference architectures.

Operations

Architected and managed distributed storage using Community Ceph and Red Hat Ceph across multi-year operational cycles
Led pre-sales and solution architecture for enterprise HPC requirements
Presented at ISC High Performance (Hamburg) and Supercomputing (US) conferences
Multi-year collaboration with HPE HPC team in Grenoble (Centre of Excellence)

Notable Deployments

Stanford Living Heart Project (with UberCloud): Provided HPCFLOW infrastructure for Stanford University's breakthrough cardiac simulation research — enabling detailed finite-element models of the human heart. The project won Cloud HPC Awards from Intel, HPCWire, and Hyperion at SC17.
Human Brain Project (UberCloud Experiment #200): Led HPC infrastructure provision for personalized clinical treatment simulations for schizophrenia and Parkinson's disease research.

Technologies Mastered

Slurm workload management
InfiniBand/Omni Path fabrics
CloudStack orchestration
Bare metal provisioning
Multi-tenant security
Usage-based billing systems
Distributed systems
Performance monitoring

HPCFLOW demonstrated that with the right technical innovation and business model, it's possible to democratize access to even the most complex computing infrastructure.