HPC Platform for Quantitative Trading — Built from Zero

Executive Summary

As the founding engineer of Millennium Management's HPC platform, I led strategic architecture decisions and delivered the complete platform from concept to production — defining the technical roadmap for quantitative trading infrastructure and coordinating across trading technology, compliance, and operations teams.

The Challenge

Modern quantitative finance demands unprecedented computational power. Portfolio managers and researchers need:

Instant Access: On-demand computational resources without infrastructure complexity
Massive Scale: Ability to run thousands of parallel simulations
Regulatory Compliance: Strict data residency and security requirements
Cost Efficiency: Optimal resource utilization across on-premise and cloud
24/7 Reliability: Zero-downtime operations for global markets

Traditional HPC approaches couldn't meet these demands while maintaining the agility required in fast-moving financial markets.

The Solution

Hybrid Cloud Architecture

We designed a groundbreaking hybrid cloud HPC platform that seamlessly integrates:

On-Premise Core: Low-latency compute for time-sensitive trading algorithms
Cloud Burst: Elastic scaling to major cloud providers for research workloads
Unified Management: Single pane of glass for all computational resources
Self-Service Portal: Researchers provision resources without IT intervention

What Was Built from Scratch

Automation Framework

A 20-role Ansible automation framework covering the full HPC stack — OS provisioning through SLURM and all auxiliary platform services. Custom-built APIs for dynamic inventory management and service discovery, replacing manual processes across the entire platform lifecycle.

SLURM Platform

Deployed SLURM v25.05.5 with munge authentication and enroot containerization for reproducible workloads. Architected multi-partition hybrid clusters integrating on-premises and public cloud resources — including a custom AWS EC2 plugin with multi-role support and elastic capacity management.

Self-Service Portal

Built a full-stack self-service platform: Go-based REST APIs and CLIs plus a React web interface enabling portfolio managers and technical teams to provision and configure federated SLURM clusters and partitions on-demand. Eliminated the need for operations team involvement in routine cluster provisioning.

Observability

Standardized monitoring platform across all compute using node, process, cgroup, and DCGM exporters feeding into Prometheus/Grafana. Comprehensive operational runbooks for all platform components (slurmctld, slurmdbd, slurmrestd, munge) with change management processes.

Technology Stack

Workload Management: SLURM v25.05.5, munge, enroot
Automation: Ansible (20-role framework), cloud-init
Cloud: AWS (custom EC2 plugin), hybrid on-prem/cloud
Self-Service: Go APIs/CLIs, React, TypeScript
Observability: Prometheus, Grafana, node/cgroup/dcgm exporters
Languages: Go, Python, Bash

Note: Specific performance metrics and architectural details have been generalized to respect confidentiality while demonstrating the scale and impact of the work.