Executive Summary
As the founding engineer of HPCFLOW, I built a multi-tenant HPC platform from ground zero at Advania — integrating OpenStack, Ironic, Packer, Slurm, and Ceph with custom solutions to deliver HPC clusters as a service. When the platform was adopted by Advania Data Centers (later atNorth), I transitioned to CTO and scaled it from a single-region solution to multi-regional HPC-as-a-Service serving enterprise clients.
The Problem
Traditional HPC infrastructure presented significant barriers to entry:
- Capital Requirements: Millions in upfront hardware investment
- Expertise Gap: Shortage of HPC specialists
- Utilization Challenges: Resources sitting idle 60-80% of the time
- Scaling Limitations: Fixed capacity couldn't meet variable demand
- Security Concerns: Multi-tenancy considered impossible for HPC
Small to medium enterprises and research groups were effectively locked out of HPC capabilities.
The Vision
Create an "AWS for HPC" - a platform where anyone could access supercomputing resources on-demand, paying only for what they use, without compromising on performance or security.
The Innovation
Technical Breakthroughs
1. True Multi-Tenant HPC
Challenge: Traditional HPC assumes single-tenant, trusted environments
Innovation:
- First-ever implementation of Omni Path vFabric for network isolation
- Hardware-level security without performance penalty
- Complete tenant isolation at InfiniBand layer
2. Elastic Resource Management
Challenge: HPC workloads have unpredictable resource requirements
Innovation:
- Dynamic resource allocation with sub-second provisioning
- Intelligent scheduling predicting workload patterns
- Automatic scaling based on queue depth and SLAs
3. Bare Metal Provisioning
Challenge: Containers/VMs add unacceptable overhead for HPC
Innovation:
- Custom bare-metal provisioning for CloudStack
- 2-minute deployment of fully configured HPC nodes
- Zero-overhead multi-tenancy
Platform Architecture
User Layer: Web Portal | API | CLI
↓
Control Plane: Authentication | Billing | Monitoring
↓
Orchestration: Slurm | Custom Schedulers | CloudStack
↓
Fabric Layer: Omni Path vFabric | InfiniBand Partitioning
↓
Compute Layer: Bare Metal Nodes | GPU Clusters | Storage
Phase 1: Founding (2016-2018) — Advania
As sole founding engineer, I designed and built HPCFLOW from a blank repository:
- Platform stack: OpenStack (Ironic for bare metal), Packer for golden images, Slurm for workload management, Ceph for distributed storage
- Bare metal provisioning: Automated HPC cluster deployment from golden images using HPE Performance Cluster Manager (CMU)
- Early Kubernetes: Adopted Kubernetes v1.0 in 2016 for internal services and monitoring — one of the earliest production deployments
- Customer delivery: Led pre-sales and solution architecture, enabling enterprise clients to migrate to IaaS-based HPC at on-premises performance levels
Phase 2: Scale (2018-2021) — atNorth / Advania Data Centers
As CTO when HPCFLOW was adopted by Advania Data Centers (later atNorth), I scaled the platform to multi-regional HPCaaS:
Technical Breakthroughs
Omni Path vFabric Multi-tenancy — Pioneered network isolation at the InfiniBand fabric layer, integrating vFabric support directly with OpenStack Neutron's port allocation model. This solved what the industry considered an unsolvable problem: true hardware-level tenant isolation without performance penalty.
Custom Switch Implementations — Developed two switch implementations for OpenStack's network-generic-switch ML2 plugin, enabling multi-tenant bare metal networking for HPE Flex Fabric and Cumulus OS environments.
Intel Select Solutions — Achieved Intel Select Solutions for High-Performance Computing certification, validating platform performance against Intel's reference architectures.
Operations
- Architected and managed distributed storage using Community Ceph and Red Hat Ceph across multi-year operational cycles
- Led pre-sales and solution architecture for enterprise HPC requirements
- Presented at ISC High Performance (Hamburg) and Supercomputing (US) conferences
- Multi-year collaboration with HPE HPC team in Grenoble (Centre of Excellence)
Notable Deployments
- Stanford Living Heart Project (with UberCloud): Provided HPCFLOW infrastructure for Stanford University's breakthrough cardiac simulation research — enabling detailed finite-element models of the human heart. The project won Cloud HPC Awards from Intel, HPCWire, and Hyperion at SC17.
- Human Brain Project (UberCloud Experiment #200): Led HPC infrastructure provision for personalized clinical treatment simulations for schizophrenia and Parkinson's disease research.
Technologies Mastered
- Slurm workload management
- InfiniBand/Omni Path fabrics
- CloudStack orchestration
- Bare metal provisioning
- Multi-tenant security
- Usage-based billing systems
- Distributed systems
- Performance monitoring
HPCFLOW demonstrated that with the right technical innovation and business model, it's possible to democratize access to even the most complex computing infrastructure.