The First Thing Finance Taught Me About Infrastructure Is That Microseconds Are Real
The Moment It Clicked
We had workloads at Millennium that should have run in a few minutes. Parallelised across multiple cores, they should have been faster. Instead, they were slower — minutes slower. Multiply that by thousands of these jobs scheduled daily, and the aggregate impact was crushing throughput across the cluster.
The root cause was memory bandwidth contention combined with non-optimal C-state settings. Cores waking from deep power-saving states were adding latency on every context switch. Parallel threads sharing memory channels were fighting for bandwidth instead of getting it. Each individual overhead was small — microseconds here, milliseconds there — but on jobs that only run for a few minutes, those overheads become a significant fraction of total runtime. Scale that to thousands of jobs and you've turned a fast cluster into a slow one.
That investigation is where I understood — viscerally, not just intellectually — that I had moved into a different category of infrastructure problem. In research HPC, a 1% slowdown on a seventy-two-hour simulation is a rounding error. In finance, a few minutes added to a few-minute job, multiplied by thousands, is a platform-level failure.
What Research HPC Gets to Ignore
In research HPC environments, the relevant timescales are hours and days. A climate simulation runs for seventy-two hours. A genomics pipeline runs overnight. The question is whether the job completes correctly and whether the cluster is utilised efficiently. Tail latency — the p99 or p99.9 of your response distribution — is not something most research HPC teams spend time on.
At Millennium, the relevant timescale is microseconds, and the operative concept is not average latency but jitter. A trading system with 5 microsecond average latency but 50 microsecond p99 is worse than a system with consistent 10 microsecond latency. The worst case matters more than the mean, because the worst case is when the market has moved and your order arrives late.
This was the first real conceptual shift I had to make. I already understood latency. I did not yet fully understand latency variance as the primary design constraint.
Hardware Knowledge Transfers. Completely.
The good news for anyone moving from research HPC to finance HPC: the hardware fundamentals are identical. The physics does not change because the money is different.
NUMA topology is NUMA topology. In research HPC, I spent years making sure compute jobs had memory affinity — that workloads running on cores in NUMA node 0 were not reaching across the interconnect to memory in NUMA node 1. Each NUMA hop adds roughly 30–40 nanoseconds of latency. In a simulation that allocates memory once and then runs for hours, this is often acceptable. In a trading path that executes thousands of times per second, it is not. The constraint is the same; the tolerance is different.
CPU affinity and isolation also transfer directly. Pinning critical threads to isolated CPU cores — cores that the operating system scheduler does not touch — eliminates scheduling jitter. This is standard practice in research HPC for MPI applications where you want predictable process placement. In finance, the same technique is applied to trading threads. The kernel does not interrupt your critical path. Nothing else runs on those cores.
Cache hierarchy awareness is something research HPC practitioners tend to think about in the context of memory bandwidth and vectorisation. In finance it becomes acute because the working set of a trading algorithm often fits in L1 or L2, and the cost of an L3 cache miss — roughly 30–40 nanoseconds on a modern CPU — can matter. In practice, the engineers who arrive from research backgrounds already have the mental model. What changes is the acceptable threshold.
What I did not need to re-learn: the relationship between CPU frequency, memory latency, and throughput. The Intel Architecture Reference. How to read a numactl --hardware output and make decisions from it. These transferred immediately.
What Did Not Transfer: The Networking Model
In research HPC, I was used to InfiniBand. We ran it everywhere at Advania — HDR, EDR before that, and the latency profile was one of the primary reasons. InfiniBand latency between hosts is around 1–2 microseconds, versus roughly 10–15 microseconds for standard Ethernet with a kernel network stack.
Finance pushed this further in a direction I had not worked with directly before: kernel bypass networking and RDMA used not for MPI message passing between compute nodes but for trading infrastructure — market data feeds, order management systems, risk engines. The principle is the same as InfiniBand RDMA, but the application layer is entirely different. Instead of moving large buffers between MPI ranks, you are moving small, latency-sensitive messages as fast as physically possible.
The kernel network stack adds overhead because every packet travels through multiple layers of software, crosses the user/kernel boundary twice, and involves context switches. Kernel bypass eliminates this. With RDMA-capable NICs and a framework like DPDK or Solarflare's OpenOnload, you can get network latency below 1 microsecond end-to-end within a data centre. The order of magnitude difference versus a standard TCP socket is real, and the engineering required to achieve it is significant.
Coming from HPC, I understood the InfiniBand side of this. The specific tooling and vendor ecosystem in finance was different — different NIC vendors, different tuning parameters, different operational assumptions. That took time to learn.
Why Physical Topology Still Matters
Something I expected to be less relevant in a cloud-friendly era turned out to matter enormously: where things are in the room.
Light travels roughly 30 centimetres per nanosecond in a vacuum, and somewhat slower through fibre. At the microsecond timescale, physical cable length between systems is measurable. A switch hop adds latency — typically 200–500 nanoseconds for a modern low-latency switch. Eliminating a switch hop by placing two systems in the same rack and connecting them directly with a DAC cable saves real time.
This is physics. You cannot optimise your way around it. For the most latency-sensitive paths, colocation in the same rack with direct-attach cabling is not a luxury — it is an architectural requirement.
I had thought about physical topology in research HPC in terms of network hops for MPI communication. The same reasoning applies, just with tighter tolerances.
The Real Learning: Predictability Over Speed
The deepest thing finance HPC taught me is the distinction between fast and predictably fast.
In research HPC, if a parallel job takes 10% longer on a given run because of OS noise — interrupts, kernel daemons, occasional scheduling hiccups — it is usually acceptable. The job finishes. The results are correct. Nobody tracks the 10% variance across runs.
In a trading system, OS noise is a latency spike, and a latency spike at the wrong moment means a missed opportunity or a bad execution. The engineering effort is not just to make the system fast on average. It is to eliminate the sources of variance that cause the system to sometimes be slow. This includes:
- Disabling CPU frequency scaling (no more power-saving states that take microseconds to exit)
- Disabling hyperthreading on cores used for critical paths (logical threads sharing physical resources create interference)
- Tuning the kernel to reduce interrupt coalescing on network interfaces
- Isolating critical CPU cores from the scheduler entirely with
isolcpus - Configuring huge pages to eliminate TLB misses from the critical path
Several of these I had done in research HPC contexts — huge pages and CPU affinity are standard practice for MPI applications that need predictable performance. Others were new to me, or I had encountered them as optional optimisations rather than non-negotiable requirements.
The specific numbers matter less than the order of magnitude. The "$X million per microsecond" calculations that get cited in this context are illustrative — the actual relationship between latency and profitability depends entirely on trading strategy, market conditions, and volume. But the directional claim is real: at the scale Millennium operates, infrastructure decisions have financial consequences that are measurable and meaningful. That is why the two-hour post-mortem for a 40 microsecond regression was not theatre. It was the appropriate response.
What Surprised Me
I expected the engineering culture to feel different from research HPC — more pressure, more urgency. That is true. What I did not expect was how much of the knowledge transferred directly.
The SLURM side of the work was immediately familiar. Job scheduling, partition management, resource allocation, fair-share accounting — these are identical whether you are scheduling molecular dynamics simulations or quantitative research workloads. The users are different (portfolio managers and quants rather than researchers), the SLAs are different, but the scheduler does not know or care.
The HPC storage work — parallel filesystems, IOPS requirements, data locality — also transferred cleanly. The scale is different but the problems are recognisable.
What required the most adjustment was internalising the latency culture. Not just understanding that microseconds matter, but operating in an environment where they are the primary metric, where every architectural decision is evaluated against a latency budget, and where the acceptable answer to "is this fast enough?" is almost always "no, make it faster."
That adjustment took longer than learning any specific technology.
Skills That HPC Gives You for Free
If you are coming from research HPC and considering finance infrastructure, here is what you are walking in with that will matter immediately:
NUMA awareness. Most engineers who have not worked in HPC have never looked at a NUMA topology diagram. HPC practitioners think about it constantly.
The ability to read hardware specifications for what they actually mean. Memory bandwidth, cache sizes, interconnect latency — these numbers mean something concrete to an HPC engineer. In finance, they mean something too.
Comfort with low-level Linux tuning. cgroups, CPU isolation, huge pages, NUMA binding — standard HPC tooling that turns out to be directly applicable.
An intuition for where time goes. The question "what is actually slow and why?" is the same question in both environments. The acceptable answer threshold is different.
What finance added, for me, was the sharpness that comes from a tighter feedback loop. In research HPC, you find out a job was slow after it finishes. In finance, you find out in the post-mortem. The stakes make you more precise.