Picunada

Developer

Designer

WebGL

Picunada

[ case study ]

Improving provider cascade performance

Cutting provider-discovery latency from 50–60 ms to 3–5 ms on the critical path of every transaction - by splitting near-static reference data out of Postgres and into a NATS JetStream KV cache behind a new Go service.

Company: Payments startup
Role: Backend Tech Lead
Period: 2026 · 3–4 weeks build · ~3 weeks in production
Stack: Go · Python · NATS JetStream · PostgreSQL · OpenTelemetry · Grafana

Problem

Our platform runs payments through a cascade - a per-request, ordered list of eligible providers we try until one approves. The cascade is rebuilt from scratch on every order, with eligibility driven by terminal config, currency and method, tags, fee rules, deposit limits, and provider live state.

At our scale - ~400 TPS across ~200 providers, 10 countries, ~30 payment methods - that build sits squarely on the critical path of every transaction. The legacy implementation lived inside our Python cascade-core monolith as an in-process ProviderDiscoveryService. Each call fanned out into a series of small Postgres queries: external method joins, terminal blocks, accept/decline tags, fee rule scopes, per-provider deposit balances. End-to-end the build took 50–60 ms p50 before we ever sent a byte to a provider.

Providers take their own time to approve - every millisecond we add to the wrapper directly extends user-perceived latency and shortens the window in which we can retry.

Constraints

Drop-in replacement. Provider contracts and external API shape were fixed; the new service had to slot in for both cascade-core and pool-service callers without changes on their side.
Provably equivalent semantics. Filtering, ordering, fee resolution, and deposit checks all drive money movement; any silent drift surfaces as missing routes, mispriced orders, or lost revenue. The new service had to reproduce the ten-step pipeline exactly before traffic was cut over.
Compliance. No provider identity leaks to clients - provider data stays server-side.
Sign-off. Ops (who would run the new service) and CTO (architecture and risk of replacing a load-bearing path).

Diagnosis

OpenTelemetry traces made the shape of the problem obvious: a single discover_providers span fanned out into 6–10 child DB spans, each only 3–8 ms but serialized. Most of those spans queried tables that change minutes-to-hours apart, not per-request - providers, terminals, fee rules, tags, blocks. The remaining real-time signals were provider liveness (rolling response-time metrics) and live deposit balances.

Two distinct workloads were tangled in one hot path. The root cause wasn't bad SQL or a missing index - it was that we were re-reading near-static reference data on every transaction.

Solution

I built cascade-builder, a new Go microservice that replaces the in-process discovery and exposes a single NATS request-reply subject (core.cascade.v2.discover-providers). The engine is a faithful Go port of the legacy ten-step pipeline - same filter order, same eligibility rules, same sort priority - so callers receive the same answers, just faster.

The architecture splits the workload by mutation rate:

Caller (cascade-core / pool-service)
        │  NATS request-reply
        ▼
┌──────────────────────────────┐
│   10-step filter pipeline    │
│   (in-memory, Go)            │
└──────┬──────────────┬────────┘
       ▼              ▼
 NATS KV cache    PostgreSQL
 (steps 1–8)      (step 9: live deposits)

Reference data (providers, terminals, external methods, fee rules, tags, blocks) is loaded into three NATS JetStream KV buckets at startup, with a 5-minute safety reload.
Real-time invalidation is driven by core.entity.updated.* messages that cascade-core already emits on every write. A durable JetStream consumer in cascade-builder reloads the affected entity by ID - typically sub-second propagation.
Provider response-time metrics (rolling 5-minute averages) are computed by our analytics service and published as periodic snapshots; cascade-builder subscribes and keeps the snapshot in memory so the engine reads it in O(1).
One DB query remains in the hot path - the live deposit balance check, which has to be authoritative.

One non-obvious tradeoff was worth surfacing explicitly:

Sub-second eventual consistency on reference data was acceptable for our use case, but had to be sold to ops as a deliberate choice rather than an artifact.

Rollout

Shadow mode did double duty - proving both correctness and speed. For every production request, cascade-core kept calling the legacy path and additionally called cascade-builder, then logged a side-by-side comparison:

discovery_comparison: nats=2(14.2ms) legacy=2(91.7ms)
                     order_match=True missing_in_nats=0 extra_in_nats=0

We watched four signals in Grafana before flipping any traffic:

order_match - provider ordering identical between paths.
only_in_legacy / only_in_nats - set-difference of returned providers.
speedup_factor - legacy_ms / nats_ms.
Per-step latency, to confirm the cache path stayed flat under load.

Any divergence funneled back to either a documented known difference or a real bug to fix. Cutover was a config flag, with the legacy path one redeploy away.

Result

Provider-discovery latency (p50)

50–60 ms → 3–5 ms

Hot-path DB queries per request

6–10 → 1 (live deposit only)

Speedup

~15×

Directionally: at ~400 TPS the old path spent ~20 cumulative DB-seconds per wall second on discovery alone. Most of that is reclaimed. The user-facing win is that the saved 40–55 ms now sits in our retry budget - when a provider declines, we can try the next one sooner without blowing past the client timeout.

Build took 3–4 weeks; the service has been in production for ~3 weeks with no rollbacks.

Reflection

Two things shadow mode caught that I'd have missed otherwise: a couple of interval-boundary mismatches around fee-rule max_amount (closed vs. half-open), and an ordering difference because the new sorter ranks on total rate rather than provider fee alone. Both were intentional in the new design but would have looked like regressions in production without the side-by-side comparison log.

If I were starting over I'd build the comparison harness first, before any of the engine code. It was the single highest-leverage thing in the project and the only reason the cutover was boring.