Skip to main content
Back to Projects

HPCFLOW: Building an HPC-as-a-Service Platform from Zero

Advania / atNorth2016-2021Founding Engineer / CTO HPCFLOW
StartupHPCPlatformMulti-tenancyInnovationFounding Engineer

Executive Summary

As the founding engineer of HPCFLOW, I built a multi-tenant HPC platform from ground zero at Advania — integrating OpenStack, Ironic, Packer, Slurm, and Ceph with custom solutions to deliver HPC clusters as a service. When the platform was adopted by Advania Data Centers (later atNorth), I transitioned to CTO and scaled it from a single-region solution to multi-regional HPC-as-a-Service serving enterprise clients.

The Problem

Traditional HPC infrastructure presented significant barriers to entry:

  • Capital Requirements: Millions in upfront hardware investment
  • Expertise Gap: Shortage of HPC specialists
  • Utilization Challenges: Resources sitting idle 60-80% of the time
  • Scaling Limitations: Fixed capacity couldn't meet variable demand
  • Security Concerns: Multi-tenancy considered impossible for HPC

Small to medium enterprises and research groups were effectively locked out of HPC capabilities.

The Vision

Create an "AWS for HPC" - a platform where anyone could access supercomputing resources on-demand, paying only for what they use, without compromising on performance or security.

The Innovation

Technical Breakthroughs

1. True Multi-Tenant HPC

Challenge: Traditional HPC assumes single-tenant, trusted environments

Innovation:

  • First-ever implementation of Omni Path vFabric for network isolation
  • Hardware-level security without performance penalty
  • Complete tenant isolation at InfiniBand layer

2. Elastic Resource Management

Challenge: HPC workloads have unpredictable resource requirements

Innovation:

  • Dynamic resource allocation with sub-second provisioning
  • Intelligent scheduling predicting workload patterns
  • Automatic scaling based on queue depth and SLAs

3. Bare Metal Provisioning

Challenge: Containers/VMs add unacceptable overhead for HPC

Innovation:

  • Custom bare-metal provisioning for CloudStack
  • 2-minute deployment of fully configured HPC nodes
  • Zero-overhead multi-tenancy

Platform Architecture

User Layer:       Web Portal | API | CLI
                         ↓
Control Plane:    Authentication | Billing | Monitoring
                         ↓
Orchestration:    Slurm | Custom Schedulers | CloudStack
                         ↓
Fabric Layer:     Omni Path vFabric | InfiniBand Partitioning
                         ↓
Compute Layer:    Bare Metal Nodes | GPU Clusters | Storage

Phase 1: Founding (2016-2018) — Advania

As sole founding engineer, I designed and built HPCFLOW from a blank repository:

  • Platform stack: OpenStack (Ironic for bare metal), Packer for golden images, Slurm for workload management, Ceph for distributed storage
  • Bare metal provisioning: Automated HPC cluster deployment from golden images using HPE Performance Cluster Manager (CMU)
  • Early Kubernetes: Adopted Kubernetes v1.0 in 2016 for internal services and monitoring — one of the earliest production deployments
  • Customer delivery: Led pre-sales and solution architecture, enabling enterprise clients to migrate to IaaS-based HPC at on-premises performance levels

Phase 2: Scale (2018-2021) — atNorth / Advania Data Centers

As CTO when HPCFLOW was adopted by Advania Data Centers (later atNorth), I scaled the platform to multi-regional HPCaaS:

Technical Breakthroughs

Omni Path vFabric Multi-tenancy — Pioneered network isolation at the InfiniBand fabric layer, integrating vFabric support directly with OpenStack Neutron's port allocation model. This solved what the industry considered an unsolvable problem: true hardware-level tenant isolation without performance penalty.

Custom Switch Implementations — Developed two switch implementations for OpenStack's network-generic-switch ML2 plugin, enabling multi-tenant bare metal networking for HPE Flex Fabric and Cumulus OS environments.

Intel Select Solutions — Achieved Intel Select Solutions for High-Performance Computing certification, validating platform performance against Intel's reference architectures.

Operations

  • Architected and managed distributed storage using Community Ceph and Red Hat Ceph across multi-year operational cycles
  • Led pre-sales and solution architecture for enterprise HPC requirements
  • Presented at ISC High Performance (Hamburg) and Supercomputing (US) conferences
  • Multi-year collaboration with HPE HPC team in Grenoble (Centre of Excellence)

Notable Deployments

  • Stanford Living Heart Project (with UberCloud): Provided HPCFLOW infrastructure for Stanford University's breakthrough cardiac simulation research — enabling detailed finite-element models of the human heart. The project won Cloud HPC Awards from Intel, HPCWire, and Hyperion at SC17.
  • Human Brain Project (UberCloud Experiment #200): Led HPC infrastructure provision for personalized clinical treatment simulations for schizophrenia and Parkinson's disease research.

Technologies Mastered

  • Slurm workload management
  • InfiniBand/Omni Path fabrics
  • CloudStack orchestration
  • Bare metal provisioning
  • Multi-tenant security
  • Usage-based billing systems
  • Distributed systems
  • Performance monitoring

HPCFLOW demonstrated that with the right technical innovation and business model, it's possible to democratize access to even the most complex computing infrastructure.