Capacity proof

A payment benchmark you can argue with.

Most throughput numbers are vanity: a cheap endpoint, a closed-model load script, an undisclosed machine. This one is built to be attacked. The rig, the connector latency model, the failure injection, and the bottleneck analysis are all published — read it as a lower bound with its assumptions visible.

~190/s

sustained charges, Postgres-only topology, before the knee

~250/s

sustained charges with the Redis + Kafka layer enabled (+30%)

system errors in passing sweeps; injected provider faults are graceful declines

What is measured

One charge = create + confirm, under open-model load.

One measured unit is a full charge: POST /payment-intents (create) followed by POST /payment-intents/{id}/confirm — not a single cheap endpoint.

Load is open-model (k6 constant-arrival-rate): the offered rate does not back off when the system slows, so queueing and the true ceiling become visible instead of being hidden by coordinated omission.

The in-process connector simulator injects log-normal provider latency fitted to p50 120ms / p99 380ms with a hard cap, plus ~0.7% injected timeouts/5xx — the system must absorb these as graceful declines, never 5xx.

Two latencies are reported: the merchant-view end-to-end time from k6, and the platform-added latency from backend metrics, isolating the platform from the simulated PSP tail.

Disclosed rig

Small on purpose, disclosed in full.

2 backend nodes (2 vCPU / 4 GB each) behind nginx, with an isolated Postgres (4 vCPU / 8 GB) as the financial source of truth. Infra-mode runs add capped Redis (0.5 vCPU / 512 MB) and Kafka (1 vCPU / 1 GB). 40 Hikari connections per backend node.

Current published numbers were captured with the load generator co-located on the same host, which contaminates absolute tail latency. The before/after deltas under identical conditions are the meaningful signal; clean off-box numbers over a wired link are the next milestone and will be published the same way.

Results

Throughput sweep (Postgres-only vs full infra)

Offered rate	Postgres-only create p99	Infra (Redis + Kafka) create p99	Verdict
100/s	44ms	46ms	both pass cleanly
150/s	—	232ms	infra clean
200/s	470ms (edge)	314ms	infra clean, pg-only at the edge
250/s	collapsed by 300/s	395ms	infra still clean
300/s	11.7s (collapse)	4.47s (collapse)	both saturated

Bottleneck analysis

The bottleneck analysis is the product

At saturation on the Postgres-only topology, the evidence pointed away from CPU: backend nodes at ~151% of a 200% cap, Postgres at ~272% of 400%, zero lock waits — but both Hikari pools fully exhausted with ~400 threads queued and 43 connections idle-in-transaction.

The ceiling matches Little’s law exactly: 80 pooled connections ÷ ~0.42s mean request hold (dominated by the simulated PSP call) ≈ 190 charges/s. A database connection was being held across the entire request, including the 120–380ms connector call.

Raising the pool to 160 connections moved throughput to ~273/s but pegged Postgres CPU and collapsed p99 — confirming the model and proving “more pool” is the wrong lever. The real levers are transaction-level pooling (PgBouncer) or eliminating the connection hold across the provider call.

Counterintuitively, enabling the production Redis + Kafka layer raised capacity ~30%: Redis offloads hot-path reads and Kafka replaces the DB outbox poller, dropping per-charge connection pressure until backend CPU becomes the binding constraint instead of the pool.

Reading the numbers

What this proves, and what it does not.

These are directional lower-bound numbers from a deliberately constrained rig, not a marketing peak. The point is that the methodology, bottleneck, and scaling path are inspectable.

The 100 charges/s gate is a stress proof, not an expected merchant baseline — most merchants never sustain a fraction of it, which is exactly the headroom argument.

Everything needed to reproduce or attack the result lives in the open-source repo: the k6 scenarios, the simulator latency model, the compose topology, and the results log.

FAQ

Benchmark methodology questions

Why is an open-model load test required for a capacity claim?

Closed-model virtual users slow down when the system slows down, silently hiding saturation (coordinated omission). Constant-arrival-rate load keeps offering traffic, so queue growth and the real knee are visible.

Why measure create + confirm as one unit?

A real charge costs both calls, including idempotency reservation, state transitions, outbox writes, and the provider call. Benchmarking only the cheapest endpoint produces a number no production system can spend.

Do the injected provider failures count as errors?

No — they are stimulus. A provider timeout or 5xx must surface as a graceful FAILED payment, never a platform 5xx. Injected faults are tracked separately and do not consume the 0.1% system-error budget.

What would it take to go faster?

The published analysis names the levers: PgBouncer-style transaction pooling or removing the DB-connection hold across the provider call for the Postgres-only path, and more backend CPU for the infra path. Each lever is documented with the evidence that motivates it.

Reproduce it, or run the platform that produced it.

The k6 scenarios, simulator latency model, topology, and results log are in the open-source repo. The platform behind the numbers self-hosts for free or runs managed at $99/month flat.

View the rig on GitHub Get started