An inside look at the architecture, engineering trade-offs, and infrastructure decisions behind building a low-latency order execution system for high-volume financial markets.
Team Sahi
In financial markets, speed isn't a feature; it's the product. Every millisecond of delay is a missed opportunity, a wider spread, or a worse fill. When we set out to build our order execution platform, our target was single-digit millisecond latency for server-side order processing. This post walks through the architecture and engineering decisions that got us there.
Our platform processes orders across multiple exchange segments like equities and derivatives, each with their own protocols and risk rules. On a busy trading day, thousands of concurrent users are placing, modifying, and cancelling orders simultaneously.
When we talk about order execution latency, we mean the full server-side journey: from the moment the request hits our servers to the moment the order is successfully placed on the exchange. That latency budget has to cover request parsing, pre-trade validation, risk checks, and exchange submission. There's no room for chatty protocols, lock contention, or garbage collection pauses.
We chose Rust for the entire order execution stack. Not because it's trendy, but because the guarantees it provides directly map to our requirements:
No garbage collector. We control exactly when memory is allocated and freed. There are no GC pauses sneaking into our p99 latencies.
Zero-cost abstractions. Generics, traits, and enums compile down to monomorphic code with static dispatch — no vtable overhead on the hot path.
Fearless concurrency. The type system catches data races at compile time, which means we can use lock-free structures aggressively without the anxiety of subtle threading bugs.
Parts of our stack integrate with C++ libraries via FFI (using the cxx crate). Rust's FFI story made this clean — we get native-speed marshalling without sacrificing safety in the rest of the codebase.
The critical path for a direct order is straightforward by design. Our order-writer service receives the HTTP request, validates it, and submits the order to the OMS — all within a single request lifecycle. Fewer hops, fewer serialization boundaries, fewer things that can add latency.
Here's what happens in that narrow latency window:
Request parsing and authentication — JWT validation and request deserialization via Axum.
Pre-trade validation — kill switch checks, market hour validation, quantity limits, and instrument eligibility. All backed by in-memory lookup tables. Microseconds.
FFI handoff to the OMS — the order is marshalled across the FFI boundary into the OMS client library. This is a thin, non-blocking translation layer — the caller hands off the order and returns immediately without waiting for network I/O.
Order validation inside the OMS — the OMS performs its own validation pass: instrument token verification, order syntax checks (valid price ticks, quantity lots), circuit breaker limit enforcement, and settlement type constraints. These are all in-memory checks against pre-loaded exchange reference data.
Risk Management System (RMS) checks — the order passes through the integrated RMS, which runs margin sufficiency calculation (does the account have enough to cover this position?), position limit checks, and price band validations, all evaluated against pre-loaded risk parameters with no network calls or database lookups on the critical path.
Exchange routing and submission — once validation and risk checks clear, the order is routed to the appropriate exchange gateway, serialized into the exchange's wire protocol, and transmitted over dedicated leased lines directly to the exchange.
Our Rust layer keeps its own work minimal, validates fast, and hands off fast. The OMS + RMS layer is extraordinarily efficient, and keeping the path to it as direct as possible preserves that efficiency.
Locks are latency landmines. A single contended mutex in the hot path can blow past our latency budget. So we designed them out:
Lock-free reads on reference data using atomic swap primitives. Our instrument lookup tables, containing exchange metadata and instrument parameters, are loaded periodically and swapped in atomically. Readers never block, even during updates.
Lock-free event queues are pre-allocated at startup with generous capacity. No allocations, no locks, bounded memory. Producers never block, and the queues are sized to handle peak market activity with headroom to spare.
Concurrent hash maps for request-response correlation across the FFI boundary. Multiple threads can insert and remove entries without coordination.
Atomic order ID generation. A single atomic fetch-and-add generates monotonically increasing IDs with zero allocation, no central ID service, and no network round-trip.
Once the OMS acknowledges an order, the post-trade machinery kicks in, but critically, none of it is in the order placement hot path. Order events, trade events, and position updates flow through a durable in-memory event stream and are consumed by a dedicated events-consumer service.
We use a durable in-memory datastore for both state and event streaming, giving us sub-millisecond access times with durability guarantees that survive node failures.
This gives us:
Ordered, durable event processing with consumer groups for at-least-once delivery.
Sub-millisecond stream latency — we track read/write latencies with histogram buckets starting at 1 microsecond.
Per-account event ordering — events for a given account are always processed in order, which is critical for maintaining position consistency.
Adaptive batching — single events are processed immediately, while bursts are batched efficiently to balance latency and throughput without adding artificial delay.
The separation is important: post-trade processing can take its time without affecting the next order placement. A separate reader service queries the same in-memory state to serve order book and position data to clients, and a WebSocket service pushes events to connected clients in real time, giving users immediate visibility into their order status without polling.
Our order execution services run on dedicated instances, not Kubernetes. This is a deliberate choice. Container orchestration adds layers, scheduling, networking overlays, and sidecar proxies that each contribute microseconds of latency and unpredictability. Running on dedicated compute removes those layers.
Running on EC2 gives us:
Predictable performance. No noisy neighbors from pod co-scheduling, no container runtime overhead, and no CNI plugin latency.
Direct network access. No service mesh, no iptables rules, no kube-proxy NAT. The order writer talks to the OMS over a direct connection with minimal network hops.
Full control over the host. We tune kernel parameters, manage process affinity, and control exactly what runs on each machine.
Our non-latency-critical services (APIs, dashboards, analytics) do run on Kubernetes with all the orchestration benefits. But the order path stays on dedicated compute, where we control every layer of the stack.
Fast code means nothing if the order takes a detour through the public internet to reach the exchange. Our network path is designed to eliminate that entirely.
Orders travel from our AWS infrastructure over a private link into a data center, and from there over dedicated leased lines directly to the exchanges. No public internet hops touch the order path at any point.
This is an area where infrastructure investment directly translates to latency savings. A single hop through the public internet can add 10-50 ms of jitter, enough to blow through our entire budget on a bad day. By keeping the network path private and deterministic, we turn network latency from a variable into a near-constant.
Building a low-latency system isn't just about fast code. The genuinely hard challenges were:
In-memory state that has to be both fast and correct. Order state and position data live in a persistent in-memory datastore for speed, but consistency matters. A stale read can lead to incorrect validations and unexpected rejections. Keeping reads fast without sacrificing correctness requires careful design of the data model and use of atomic operations.
C++ FFI safety. The OMS integration uses a C++ library with callback-based event delivery. Wrapping this in safe Rust required careful management of the unsafe boundary and extensive testing to ensure thread safety guarantees held across the FFI layer.
Clock-sensitive ID generation. Distributed order IDs need to be unique, monotonic, and fast. We use instance-scoped atomic counters with date-prefixed formatting, but this means handling clock skew across instances and restart safety without a central coordination service.
Keeping pre-trade validation fast as complexity grows. Every new product type, exchange segment, or regulatory requirement adds another check to the validation pipeline. The discipline is keeping these checks as in-memory lookups against pre-cached data, never as synchronous network calls. The moment a validation step reaches out over the network, the latency budget is at risk.
The real lesson from this journey isn't about any single optimization. It's that low latency is an architectural property, not a code-level trick. It comes from choosing the right data structures, eliminating locks on the critical path, running on dedicated infrastructure, and selecting a highly efficient OMS + RMS stack. Every layer of the stack has to agree on the same latency contract, from the choice to run on EC2 over Kubernetes all the way down to the network path to the exchange.
We're continuing to push the boundaries of what's possible in order execution latency, and we'll share more as we go.
Related
Recent
Zomato Share Price Falls for 8 Straight Days. What's Really Going On?
Why Holi Flight Prices in 2026 Are Up 185% — And What's Really Behind the Surge
MCX Gold: Meaning, Contract Details, Trading Rules and Outlook in India
RFC OFS: Find Out the Main Reasons Behind the Stake Sale
Defence stocks rally as Indian PM visits Israel: what's driving the buzz