Executive Summary
As the founding engineer of Millennium Management's HPC platform, I led strategic architecture decisions and delivered the complete platform from concept to production — defining the technical roadmap for quantitative trading infrastructure and coordinating across trading technology, compliance, and operations teams.
The Challenge
Modern quantitative finance demands unprecedented computational power. Portfolio managers and researchers need:
- Instant Access: On-demand computational resources without infrastructure complexity
- Massive Scale: Ability to run thousands of parallel simulations
- Regulatory Compliance: Strict data residency and security requirements
- Cost Efficiency: Optimal resource utilization across on-premise and cloud
- 24/7 Reliability: Zero-downtime operations for global markets
Traditional HPC approaches couldn't meet these demands while maintaining the agility required in fast-moving financial markets.
The Solution
Hybrid Cloud Architecture
We designed a groundbreaking hybrid cloud HPC platform that seamlessly integrates:
- On-Premise Core: Low-latency compute for time-sensitive trading algorithms
- Cloud Burst: Elastic scaling to major cloud providers for research workloads
- Unified Management: Single pane of glass for all computational resources
- Self-Service Portal: Researchers provision resources without IT intervention
What Was Built from Scratch
Automation Framework
A 20-role Ansible automation framework covering the full HPC stack — OS provisioning through SLURM and all auxiliary platform services. Custom-built APIs for dynamic inventory management and service discovery, replacing manual processes across the entire platform lifecycle.
SLURM Platform
Deployed SLURM v25.05.5 with munge authentication and enroot containerization for reproducible workloads. Architected multi-partition hybrid clusters integrating on-premises and public cloud resources — including a custom AWS EC2 plugin with multi-role support and elastic capacity management.
Self-Service Portal
Built a full-stack self-service platform: Go-based REST APIs and CLIs plus a React web interface enabling portfolio managers and technical teams to provision and configure federated SLURM clusters and partitions on-demand. Eliminated the need for operations team involvement in routine cluster provisioning.
Observability
Standardized monitoring platform across all compute using node, process, cgroup, and DCGM exporters feeding into Prometheus/Grafana. Comprehensive operational runbooks for all platform components (slurmctld, slurmdbd, slurmrestd, munge) with change management processes.
Technology Stack
- Workload Management: SLURM v25.05.5, munge, enroot
- Automation: Ansible (20-role framework), cloud-init
- Cloud: AWS (custom EC2 plugin), hybrid on-prem/cloud
- Self-Service: Go APIs/CLIs, React, TypeScript
- Observability: Prometheus, Grafana, node/cgroup/dcgm exporters
- Languages: Go, Python, Bash
Note: Specific performance metrics and architectural details have been generalized to respect confidentiality while demonstrating the scale and impact of the work.