Skip to main content
Back to Projects

HPC Platform for Quantitative Trading — Built from Zero

Millennium Management2023-PresentLead HPC/Grid Engineer (Founding Engineer)
HPCSLURMGoReactAnsibleHybrid CloudFounding Engineer

Executive Summary

As the founding engineer of Millennium Management's HPC platform, I led strategic architecture decisions and delivered the complete platform from concept to production — defining the technical roadmap for quantitative trading infrastructure and coordinating across trading technology, compliance, and operations teams.

The Challenge

Modern quantitative finance demands unprecedented computational power. Portfolio managers and researchers need:

  • Instant Access: On-demand computational resources without infrastructure complexity
  • Massive Scale: Ability to run thousands of parallel simulations
  • Regulatory Compliance: Strict data residency and security requirements
  • Cost Efficiency: Optimal resource utilization across on-premise and cloud
  • 24/7 Reliability: Zero-downtime operations for global markets

Traditional HPC approaches couldn't meet these demands while maintaining the agility required in fast-moving financial markets.

The Solution

Hybrid Cloud Architecture

We designed a groundbreaking hybrid cloud HPC platform that seamlessly integrates:

  • On-Premise Core: Low-latency compute for time-sensitive trading algorithms
  • Cloud Burst: Elastic scaling to major cloud providers for research workloads
  • Unified Management: Single pane of glass for all computational resources
  • Self-Service Portal: Researchers provision resources without IT intervention

What Was Built from Scratch

Automation Framework

A 20-role Ansible automation framework covering the full HPC stack — OS provisioning through SLURM and all auxiliary platform services. Custom-built APIs for dynamic inventory management and service discovery, replacing manual processes across the entire platform lifecycle.

SLURM Platform

Deployed SLURM v25.05.5 with munge authentication and enroot containerization for reproducible workloads. Architected multi-partition hybrid clusters integrating on-premises and public cloud resources — including a custom AWS EC2 plugin with multi-role support and elastic capacity management.

Self-Service Portal

Built a full-stack self-service platform: Go-based REST APIs and CLIs plus a React web interface enabling portfolio managers and technical teams to provision and configure federated SLURM clusters and partitions on-demand. Eliminated the need for operations team involvement in routine cluster provisioning.

Observability

Standardized monitoring platform across all compute using node, process, cgroup, and DCGM exporters feeding into Prometheus/Grafana. Comprehensive operational runbooks for all platform components (slurmctld, slurmdbd, slurmrestd, munge) with change management processes.

Technology Stack

  • Workload Management: SLURM v25.05.5, munge, enroot
  • Automation: Ansible (20-role framework), cloud-init
  • Cloud: AWS (custom EC2 plugin), hybrid on-prem/cloud
  • Self-Service: Go APIs/CLIs, React, TypeScript
  • Observability: Prometheus, Grafana, node/cgroup/dcgm exporters
  • Languages: Go, Python, Bash

Note: Specific performance metrics and architectural details have been generalized to respect confidentiality while demonstrating the scale and impact of the work.