Picunada

Developer

Designer

WebGL

Picunada

[ case study ]

Turning single-shot withdrawals into a durable cascade

Lifting withdrawal success from 10-20% to 50-60% by replacing single-shot provider attempts with a durable, FSM-driven cascade - orchestrated end-to-end through NATS JetStream events and delayed messages, with no long-lived processes.

Company
Payments startup
Role
Backend Tech Lead
Period
2025 · ~1.5 months build · in production
Stack
Python (FastAPI + SQLAlchemy) · NATS JetStream · PostgreSQL · Prometheus · OpenTelemetry

Problem

When a user requests a withdrawal, our platform routes it to a payment provider that pushes the money out. Provider behavior varies wildly: some confirm in seconds, others take up to 15 minutes to settle and call back with a webhook. Some reject the order outright, some accept and then fail mid-flight, some never respond at all.

The old withdrawal path lived inside cascade-core and was effectively single-shot: build the eligibility cascade, take the cheapest provider off the top, create one withdrawal order, return. If that provider declined or timed out, the withdrawal failed - falling to manual handling or a user retry. We were leaving the rest of the cascade, sometimes a dozen otherwise-viable providers, on the floor on every failure.

Two compounding issues:

  • Coverage. A single attempt against the cheapest provider is a poor strategy when provider availability fluctuates minute-to-minute. The best-fee provider's approval rate was nowhere near 100%; every miss was a lost or delayed withdrawal. Successful withdrawals sat at just 10-20%.
  • Time. Trying providers sequentially when each can take up to 15 minutes to acknowledge means the workflow can't live inside a synchronous HTTP request, an in-memory worker, or anything bound to a single process's lifecycle. It had to be durable and resumable.

Constraints

  • Money safety is non-negotiable. Every withdrawal touches provider balances; partial state from a crash mid-flight cannot translate into double-spends or stuck frozen funds.
  • API shape preserved.cascade-core's external API can't change just because withdrawal execution is being rewired underneath.
  • Webhook integrity. Provider webhooks must be HMAC-validated before they touch state.
  • Sign-off. Ops (new service to operate, on-call rotation), finance (balance accounting model), CTO (architecture).

Solution

I built payment-service, a Python microservice that owns withdrawal execution end-to-end. cascade-core no longer attempts providers itself - it hands the full cascade to payment-service and walks away. The service drives the withdrawal through the cascade in its own time.

Core abstractions

  • OrderQueue - one per withdrawal, containing an ordered list of OrderTasks (one per provider from the cascade).
  • OrderTask - one provider attempt, driven by a finite state machine:
PENDING → CREATING → EXECUTING → SUCCEEDED
                ↘ NEGATIVE_EVAL → SKIPPED
                              ↘ RECHECKING → EXECUTING / FAILED
                EXECUTING → MANUAL_CHECK → SUCCEEDED / FAILED

Queue rule: any task SUCCEEDED → queue COMPLETED, remaining tasks stop. All tasks terminal without a success → queue FAILED, balances unwound.

How "long-running" works without a long-lived process

Nothing in the service holds a request open for 15 minutes. The state machine is driven entirely by NATS JetStream events and delayed messages (Nats-Delay) acting as durable timers:

  • order.task.recheck.tick - exponential-backoff polls when a provider returned an ambiguous error.
  • order.task.webhook.timeout - fires if the provider hasn't called back by the SLA deadline.
  • order.task.status.tick / order.task.status.timeout - optional status-polling fallback.
  • order.task.webhook.received - forwarded by provider-service after HMAC validation.

Each step is a stateless handler that reads the task from Postgres, advances the FSM, persists, and publishes the next event. Consumers are durable with explicit ack and MaxDeliver=N; anything that fails repeatedly republishes to order.task.dlq for alerting. A 15-minute provider wait is just one row in Postgres and one delayed message in JetStream - nothing pinned in memory.

Money safety

Balance operations route through cascade-core as the single source of truth, keyed by task_id for idempotency:

  • freeze on CREATING
  • commit on SUCCEEDED (atomic: provider debit + terminal credit)
  • rollback on FAILED / SKIPPED / CANCELLED

Task state only advances after the balance call returns 200. If the balance API fails, the task doesn't move; the message retries. Idempotent balance calls, an order_task_events audit table, and an optimistic-locking version field on writes mean a duplicate delivery or partial crash cannot produce a double-spend.

Operational surface

  • Manual check path for stuck EXECUTING tasks with no SLA or exhausted retries - support resolves them as positive (commit) or negative (rollback), and the decision itself is a typed event with full audit trail.
  • Cancellation at task or queue granularity, with balance rollback on every active task.
  • Observability: Prometheus metrics (task_time_in_state, dlq_size, core_notifications_total) and structured logs with order_id / queue_id / task_id / trace_id for end-to-end tracing across services.

Rollout

Direct cutover - no shadow mode, no percentage gating. The old single-shot path and the new orchestrated path produce fundamentally different outcomes (one attempt vs. many), so a per-request comparison wouldn't have been meaningful. We mitigated risk by leaning on idempotent balance operations, finance reconciliation on day one, and ops on standby.

Result

Withdrawal success rate
10-20% → 50-60%
Manual-check rate
3-5%
Provider attempts per withdrawal
1 → full cascade until success
Balance discrepancies post-rollout
0 (finance reconciled)

A 3-5× lift in withdrawal success translates directly into customer-facing reliability and into revenue we were previously declining by giving up after one provider. The low 3-5% manual-check rate confirms the automated paths (recheck, status polling, DLQ) handle the long tail without dumping work on support.

Reflection

The bigger lesson from this project was about rollout. A direct cutover worked, but it worked because of idempotent balance operations and aggressive day-one monitoring - not because the cutover itself was safe. Next time, even when shadow comparison isn't meaningful, I'd ship behind a per-merchant or percentage flag so a regression is bounded.

The FSM and JetStream-timer-as-durable-state pattern, on the other hand, were the right calls and I'd reach for them again - they're what made a 15-minute external dependency tractable inside a stateless service.