Lifting withdrawal success from 10-20% to 50-60% by replacing single-shot provider attempts with a durable, FSM-driven cascade - orchestrated end-to-end through NATS JetStream events and delayed messages, with no long-lived processes.
When a user requests a withdrawal, our platform routes it to a payment provider that pushes the money out. Provider behavior varies wildly: some confirm in seconds, others take up to 15 minutes to settle and call back with a webhook. Some reject the order outright, some accept and then fail mid-flight, some never respond at all.
The old withdrawal path lived inside cascade-core and was effectively single-shot: build the eligibility cascade, take the cheapest provider off the top, create one withdrawal order, return. If that provider declined or timed out, the withdrawal failed - falling to manual handling or a user retry. We were leaving the rest of the cascade, sometimes a dozen otherwise-viable providers, on the floor on every failure.
Two compounding issues:
cascade-core's external API can't change just because withdrawal execution is being rewired underneath. I built payment-service, a Python microservice that owns withdrawal execution end-to-end. cascade-core no longer attempts providers itself - it hands the full cascade to payment-service and walks away. The service drives the withdrawal through the cascade in its own time.
OrderQueue - one per withdrawal, containing an ordered list of OrderTasks (one per provider from the cascade). OrderTask - one provider attempt, driven by a finite state machine: PENDING → CREATING → EXECUTING → SUCCEEDED
↘ NEGATIVE_EVAL → SKIPPED
↘ RECHECKING → EXECUTING / FAILED
EXECUTING → MANUAL_CHECK → SUCCEEDED / FAILED Queue rule: any task SUCCEEDED → queue COMPLETED, remaining tasks stop. All tasks terminal without a success → queue FAILED, balances unwound.
Nothing in the service holds a request open for 15 minutes. The state machine is driven entirely by NATS JetStream events and delayed messages (Nats-Delay) acting as durable timers:
order.task.recheck.tick - exponential-backoff polls when a provider returned an ambiguous error. order.task.webhook.timeout - fires if the provider hasn't called back by the SLA deadline. order.task.status.tick / order.task.status.timeout - optional status-polling fallback. order.task.webhook.received - forwarded by provider-service after HMAC validation. Each step is a stateless handler that reads the task from Postgres, advances the FSM, persists, and publishes the next event. Consumers are durable with explicit ack and MaxDeliver=N; anything that fails repeatedly republishes to order.task.dlq for alerting. A 15-minute provider wait is just one row in Postgres and one delayed message in JetStream - nothing pinned in memory.
Balance operations route through cascade-core as the single source of truth, keyed by task_id for idempotency:
freeze on CREATINGcommit on SUCCEEDED (atomic: provider debit + terminal credit) rollback on FAILED / SKIPPED / CANCELLED Task state only advances after the balance call returns 200. If the balance API fails, the task doesn't move; the message retries. Idempotent balance calls, an order_task_events audit table, and an optimistic-locking version field on writes mean a duplicate delivery or partial crash cannot produce a double-spend.
EXECUTING tasks with no SLA or exhausted retries - support resolves them as positive (commit) or negative (rollback), and the decision itself is a typed event with full audit trail. task_time_in_state, dlq_size, core_notifications_total) and structured logs with order_id / queue_id / task_id / trace_id for end-to-end tracing across services. Direct cutover - no shadow mode, no percentage gating. The old single-shot path and the new orchestrated path produce fundamentally different outcomes (one attempt vs. many), so a per-request comparison wouldn't have been meaningful. We mitigated risk by leaning on idempotent balance operations, finance reconciliation on day one, and ops on standby.
A 3-5× lift in withdrawal success translates directly into customer-facing reliability and into revenue we were previously declining by giving up after one provider. The low 3-5% manual-check rate confirms the automated paths (recheck, status polling, DLQ) handle the long tail without dumping work on support.
The bigger lesson from this project was about rollout. A direct cutover worked, but it worked because of idempotent balance operations and aggressive day-one monitoring - not because the cutover itself was safe. Next time, even when shadow comparison isn't meaningful, I'd ship behind a per-merchant or percentage flag so a regression is bounded.
The FSM and JetStream-timer-as-durable-state pattern, on the other hand, were the right calls and I'd reach for them again - they're what made a 15-minute external dependency tractable inside a stateless service.