Cutting provider-discovery latency from 50–60 ms to 3–5 ms on the critical path of every transaction - by splitting near-static reference data out of Postgres and into a NATS JetStream KV cache behind a new Go service.
Our platform runs payments through a cascade - a per-request, ordered list of eligible providers we try until one approves. The cascade is rebuilt from scratch on every order, with eligibility driven by terminal config, currency and method, tags, fee rules, deposit limits, and provider live state.
At our scale - ~400 TPS across ~200 providers, 10 countries, ~30 payment methods - that build sits squarely on the critical path of every transaction. The legacy implementation lived inside our Python cascade-core monolith as an in-process ProviderDiscoveryService. Each call fanned out into a series of small Postgres queries: external method joins, terminal blocks, accept/decline tags, fee rule scopes, per-provider deposit balances. End-to-end the build took 50–60 ms p50 before we ever sent a byte to a provider.
Providers take their own time to approve - every millisecond we add to the wrapper directly extends user-perceived latency and shortens the window in which we can retry.
cascade-core and pool-service callers without changes on their side. OpenTelemetry traces made the shape of the problem obvious: a single discover_providers span fanned out into 6–10 child DB spans, each only 3–8 ms but serialized. Most of those spans queried tables that change minutes-to-hours apart, not per-request - providers, terminals, fee rules, tags, blocks. The remaining real-time signals were provider liveness (rolling response-time metrics) and live deposit balances.
Two distinct workloads were tangled in one hot path. The root cause wasn't bad SQL or a missing index - it was that we were re-reading near-static reference data on every transaction.
I built cascade-builder, a new Go microservice that replaces the in-process discovery and exposes a single NATS request-reply subject (core.cascade.v2.discover-providers). The engine is a faithful Go port of the legacy ten-step pipeline - same filter order, same eligibility rules, same sort priority - so callers receive the same answers, just faster.
The architecture splits the workload by mutation rate:
Caller (cascade-core / pool-service)
│ NATS request-reply
▼
┌──────────────────────────────┐
│ 10-step filter pipeline │
│ (in-memory, Go) │
└──────┬──────────────┬────────┘
▼ ▼
NATS KV cache PostgreSQL
(steps 1–8) (step 9: live deposits)core.entity.updated.* messages that cascade-core already emits on every write. A durable JetStream consumer in cascade-builder reloads the affected entity by ID - typically sub-second propagation. cascade-builder subscribes and keeps the snapshot in memory so the engine reads it in O(1). One non-obvious tradeoff was worth surfacing explicitly:
Shadow mode did double duty - proving both correctness and speed. For every production request, cascade-core kept calling the legacy path and additionally called cascade-builder, then logged a side-by-side comparison:
discovery_comparison: nats=2(14.2ms) legacy=2(91.7ms)
order_match=True missing_in_nats=0 extra_in_nats=0We watched four signals in Grafana before flipping any traffic:
order_match - provider ordering identical between paths. only_in_legacy / only_in_nats - set-difference of returned providers. speedup_factor - legacy_ms / nats_ms. Any divergence funneled back to either a documented known difference or a real bug to fix. Cutover was a config flag, with the legacy path one redeploy away.
Directionally: at ~400 TPS the old path spent ~20 cumulative DB-seconds per wall second on discovery alone. Most of that is reclaimed. The user-facing win is that the saved 40–55 ms now sits in our retry budget - when a provider declines, we can try the next one sooner without blowing past the client timeout.
Build took 3–4 weeks; the service has been in production for ~3 weeks with no rollbacks.
Two things shadow mode caught that I'd have missed otherwise: a couple of interval-boundary mismatches around fee-rule max_amount (closed vs. half-open), and an ordering difference because the new sorter ranks on total rate rather than provider fee alone. Both were intentional in the new design but would have looked like regressions in production without the side-by-side comparison log.
If I were starting over I'd build the comparison harness first, before any of the engine code. It was the single highest-leverage thing in the project and the only reason the cutover was boring.