How Would I Build a Payment System That Doesn't Lose Money
Source: Dev.to
How Would I Build: A Payment System for 10 000 TPS
I take a real engineering problem, reason through it in plain language first, then attach the actual name for that reasoning at the end. Jargon comes last, not first.
The Problem
Goal: Build a payment system that can handle 10 000 transactions per second without losing a single one.
1. Concurrency – The “Suya” Analogy
Analogy: Two customers walk up to a suya stall at the same time, both want the last stick. The vendor has only two hands, so he can’t serve them simultaneously. If he hands out the same stick to both, one will end up empty‑handed.
In a payment system the same thing happens when two withdraw requests hit the same wallet at the exact same moment:
- Both requests read the balance.
- Both see enough money.
- Both write a debit.
Result → double payout.
Term: Concurrency problem – specifically a race condition.
2. Locking – “Whoever Arrives First Locks the Wallet”
Solution: The first request acquires a lock on the wallet row in the database. The second request must wait, then re‑reads the balance after the lock is released.
Term: Pessimistic locking (or simply locking).
3. Vertical Scaling – “Bigger Grill, Same Chef”
Analogy: The vendor upgrades the grill, adds more charcoal, gets a faster stove. Orders go out faster, but there’s still only one chef, so customers still wait.
Term: Vertical scaling – adding more CPU, RAM, or faster storage to a single machine. It raises the ceiling but doesn’t remove the fundamental bottleneck.
4. Horizontal Scaling – “More Stalls, Same Chef per Stall”
Analogy: The vendor opens a second, then a third, suya spot. Each stall has its own (smaller) grill, but the total output is much higher.
Term: Horizontal scaling – adding more machines (or instances) to share the load. The limiting factor becomes cost, not hardware.
5. Read Replicas – “Separate Person Handles “Is My Order Ready?”**
Analogy: Most customers just want to know if their order is ready. The vendor hires a second person who only answers status queries, using a copy of the order list. The main chef can now focus on cooking.
Term: Read replicas – a primary database handles writes; replicas serve read traffic, preventing reads from choking writes.
Caveat: Replicas don’t increase write capacity. If 10 000 new orders per second arrive, the primary still hits its ceiling.
6. Sharding – “Multiple Stalls Share the Load**
Analogy: Customers 1‑2 000 go to stall 1, 2 001‑4 000 go to stall 2, etc. New orders are partitioned across several locations.
Term: Sharding – splitting data (or traffic) across multiple independent databases/servers.
7. Distributed Transactions – “The SAGA Pattern**
Problem: Spot 1 takes money, then needs Spot 3 to release the order. If the phone dies, Spot 3 never gets the message → money lost, order not delivered.
Naïve fix: Two‑phase commit (both sides must agree before proceeding). Works but is slow and fragile.
Better fix: Spot 1 records “I took money, Spot 3 owes a release.” If Spot 3 never confirms, Spot 1 automatically refunds. Each service performs its own step and has a compensating action ready.
Term: Saga – a pattern for managing distributed transactions with explicit compensation steps.
8. Load Balancing – “The Entrance Greeter**
Analogy: Five stalls are open, but customers don’t know which line is shortest. A greeter looks at all stalls and points each new customer to the next free one.
Term: Load balancer – sits in front of servers and distributes incoming traffic so no single server is overloaded (e.g., Nginx, HAProxy, Envoy).
9. High Availability & Failover – “Backup Greeter**
Problem: The greeter is a single point of failure.
Solution: Deploy a hot‑standby greeter that automatically takes over if the primary fails.
Term: Failover – automatic switchover to a redundant component, forming a high‑availability (HA) setup.
10. Rate Limiting – “Throttle Per‑Customer Orders**
Analogy: Limit how many orders a single person can place per minute. This won’t stop legitimate traffic, but it prevents a bad actor from overwhelming the system.
Term: Rate limiting – caps request rates per client, API key, IP, etc.
11. Observability at Scale – “Where Do I Start Looking?**
At 10 000 TPS a failed transaction could break anywhere:
- Entrance (load balancer)
- Order taker (application server)
- Payment handler (service)
- Records store (database)
- Notification sender (messaging system)
Logs, metrics, and traces are scattered across dozens of machines.
Solution:
- Structured logging – JSON logs with request IDs.
- Centralized log aggregation – ELK/EFK stack, Loki, or Splunk.
- Distributed tracing – OpenTelemetry, Jaeger, Zipkin.
- Metrics & dashboards – Prometheus + Grafana.
- Alerting – PagerDuty/Opsgenie on latency, error rates, resource saturation.
These give you a single place to start digging when something goes wrong.
12. Putting It All Together
| Layer | Technique | Why It Matters |
|---|---|---|
| Concurrency control | Pessimistic locking / optimistic concurrency | Prevent race conditions / double spend |
| Scaling | Vertical → Horizontal → Sharding | Move from single‑node limits to distributed capacity |
| Read/write separation | Primary + read replicas | Keep reads from throttling writes |
| Distributed transactions | Saga pattern | Ensure eventual consistency without blocking |
| Traffic distribution | Load balancer + HA failover | Avoid single points of failure, spread load |
| Abuse protection | Rate limiting | Guard against malicious spikes |
| Observability | Centralized logs, tracing, metrics | Fast root‑cause analysis at massive scale |
TL;DR
- Concurrency → lock the wallet (race condition)
- Vertical scaling → bigger grill (limited)
- Horizontal scaling → more grills (cost‑limited)
- Read replicas → separate “order‑status” staff (writes still limited)
- Sharding → split customers across grills (true write scaling)
- Saga → compensate instead of lock‑step (resilient distributed ops)
- Load balancer → greeter (even traffic)
- Failover → backup greeter (high availability)
- Rate limiting → per‑customer caps (protect resources)
- Observability → centralized logs/metrics/traces (find bugs fast)
With these building blocks you can design a payment system that reliably processes 10 000 transactions per second while keeping data integrity, availability, and observability intact.
Correlation ID
When a request first enters the system, it is assigned a unique ticket number (the correlation ID). This ID is written into every record that the order touches across every service and storage location.
Why it matters
- When something breaks, you can search for that ticket number and see the full journey of the order in one place.
- Every step, every spot, is linked by the same ID, giving you a complete, end‑to‑end view.
Distributed Tracing
Following a request across services using the correlation ID is called distributed tracing. It lets you understand how a single transaction moves through a complex, multi‑service architecture.
Logging & Observability Stack
- Datadog, ELK stack, etc., pull logs from all your servers into a single searchable repository.
- Elasticsearch stores the logs.
- Kibana provides the dashboard you use to search and visualise them.
Typical Payment‑System Flow
- Load Balancer – receives the request and routes it to one of several servers; a standby server is ready if the active one fails.
- Rate Limiting – filters abusive traffic before it reaches the application.
- Server Processing – handles the transaction using pessimistic locking to prevent race conditions.
- Read/Write Strategy – reads are served from replicas, writes go to the primary node.
- Sharding – distributes write load across multiple database nodes.
- Cross‑Spot Transactions – use the SAGA pattern so that failures don’t leave money in limbo.
- Logging – every step logs the correlation ID, which flows into the centralized logging system.
Because each log entry carries the same correlation ID, any transaction can be traced end‑to‑end in seconds.
“That’s a payment system that doesn’t lose money.”
I’m Damola, a backend engineer.
Find the rest of this series on GitHub and follow me on Dev.to for the next one.