Outbox Pattern: The Hard Parts (and How Namastack Outbox Helps)
Source: Dev.to
Most people know the Transactional Outbox Pattern at a high level.
What’s often missing are the production‑grade details – the “hard parts” that decide whether your outbox is reliable under real load and failures:
- Ordering semantics (usually per aggregate/key, not global) and what happens when one record in a sequence fails
- Scaling across multiple instances without lock pain (partitioning + rebalancing)
- Retries that behave well during outages
- A clear strategy for permanently failed records
- Monitoring and operations (backlog, failures, partitions, cluster health)
This article focuses on those hard parts and how Namastack Outbox addresses them.
Quick links to the docs
If you want a quick primer first, this video introduces Namastack Outbox and recaps the basic concepts behind the Outbox Pattern.
The Hard Parts
Ordering: what you actually need in production
When people say “we need ordering”, they often mean global ordering. In production that’s usually the wrong goal.
What you typically need is ordering per business key (often per aggregate):
- For a given
order-123, process records strictly in creation order. - For different keys (
order-456,order-789), process in parallel.
How Namastack Outbox defines ordering
Ordering is defined by the record key:
- Same key → sequential, deterministic processing
- Different keys → concurrent processing
@Service
class OrderService(
private val outbox: Outbox,
private val orderRepository: OrderRepository
) {
@Transactional
fun createOrder(command: CreateOrderCommand) {
val order = Order.create(command)
orderRepository.save(order)
// Schedule event – saved atomically with the order
outbox.schedule(
payload = OrderCreatedEvent(order.id, order.customerId),
key = "order-${order.id}" // Groups records for ordered processing
)
}
}
With Spring events:
@OutboxEvent(key = "#this.orderId")
data class OrderCreatedEvent(val orderId: String)
Failure behavior: should later records wait?
The key production question is what happens if one record in the sequence fails.
- Default (
outbox.processing.stop-on-first-failure=true): later records with the same key wait. This preserves strict semantics when records depend on each other. - If records are independent, set
outbox.processing.stop-on-first-failure=falseso failures don’t block later records for the same key.
Choosing good keys
Use the key for the unit where ordering matters:
order-${orderId}customer-${customerId}
Avoid keys that are:
- Too coarse (serialize everything), e.g.
"global" - Too fine (no ordering), e.g. a random UUID
Why ordering still works when scaling out
Namastack Outbox combines key‑based ordering with hash‑based partitioning, so a key consistently routes to the same partition and only one active instance processes it at a time.
Scaling: partitioning and rebalancing
Scaling an outbox from 1 instance to N instances is where many implementations fall apart. You need both:
- Work distribution – all instances can help.
- No double processing + preserved ordering – especially for the same key.
A common approach is “just use database locks”. That can work, but it often brings lock contention, hot rows, and unpredictable latency once traffic grows.
Namastack Outbox approach: hash‑based partitioning
Instead of distributed locking, Namastack Outbox uses hash‑based partitioning:
- 256 fixed partitions.
- Each record key is mapped to a partition using consistent hashing.
- Each application instance owns a subset of those partitions.
- An instance only polls/processes records for its assigned partitions.
Result
- Different instances don’t compete for the same records (low lock contention).
- Ordering stays meaningful: same key → same partition → processed sequentially.
What rebalancing means
In production the number of active instances changes:
- Deploy a new version (rolling restart)
- Autoscaling adds/removes pods
- An instance crashes
Namastack Outbox periodically re‑evaluates which instances are alive and redistributes partitions. This is the rebalancing step.
Important: Rebalancing is designed to be automatic – you shouldn’t need a separate coordinator.
Instance‑coordination knobs
These settings control how instances coordinate and detect failures:
outbox:
rebalance-interval: 10000 # ms between rebalance checks
instance:
heartbeat-interval-seconds: 5 # how often to send heartbeats
stale-instance-timeout-seconds: 30 # when to consider an instance dead
graceful-shutdown-timeout-seconds: 15 # time to hand over partitions on shutdown
Rules of thumb
- Lower heartbeat + stale timeout → faster failover, more DB chatter.
- Higher values → less overhead, slower reaction to node failure.
Practical guidance
- Keep your key design intentional (see the Ordering chapter). It drives both ordering and partitioning.
- If one key is extremely “hot” (e.g.,
tenant-1), it will map to a single partition and become a throughput bottleneck. In that case, consider a more granular key (e.g.,tenant-1-order-${orderId}) or increase the number of partitions (if you control the implementation). - Monitor partition lag, backlog size, and failed‑record counts. Alerts on sudden spikes help you react before the system stalls.
- Define a dead‑letter strategy: after N retries, move the record to a dead‑letter table or topic for manual investigation.
- Test failure scenarios (DB outage, instance crash, network partition) in a staging environment to verify that ordering, rebalancing, and retry semantics behave as expected.
Outages: retries and failed records
Outages and transient failures are not edge cases — they’re normal: rate limits, broker downtime, flaky networks, credential rollovers.
The hard part is making retries predictable:
- Retry too aggressively → you amplify the outage and overload your own system.
- Retry too slowly → your backlog grows and delivery latency explodes.
Namastack Outbox retry model (high level)
Each record is processed by a handler. If the handler throws, the record is not lost — it is rescheduled for another attempt.
Records move through a simple lifecycle:
| State | Meaning |
|---|---|
NEW | Waiting / retrying |
COMPLETED | Successfully processed |
FAILED | Retries exhausted (needs attention) |
Default configuration knobs
You can tune polling, batching, and retry via configuration:
outbox:
poll-interval: 2000 # ms
batch-size: 10
retry:
policy: exponential
max-retries: 3
# Optional: only retry specific exceptions
include-exceptions:
- java.net.SocketTimeoutException
# Optional: never retry these exceptions
exclude-exceptions:
- java.lang.IllegalArgumentException
A good production default is exponential backoff, because it naturally reduces pressure during outages.
What happens after retries are exhausted?
Namastack Outbox supports fallback handlers for that case:
- retries exhausted, or
- non‑retryable exception
For annotation‑based handlers, the fallback method must be on the same Spring bean as the handler.
@Component
class OrderHandlers(
private val publisher: OrderPublisher,
private val deadLetter: DeadLetterPublisher,
) {
@OutboxHandler
fun handle(event: OrderCreatedEvent) {
publisher.publish(event)
}
@OutboxFallbackHandler
fun onFailure(event: OrderCreatedEvent, ctx: OutboxFailureContext) {
deadLetter.publish(event, ctx.lastFailure)
}
}
- If a fallback succeeds, the record is marked
COMPLETED. - If there’s no fallback (or the fallback fails), the record becomes
FAILED.
Practical guidance
- Decide early what “FAILED” means in your organization: alert, dashboard, dead‑letter queue, or manual replay.
- Keep retry counts conservative when handlers talk to external systems; rely on backoff rather than fast loops.
- For critical flows, use metrics to alert when
FAILEDrecords appear or when the backlog grows.
Next steps
If you found this useful, I’d really appreciate a ⭐ on GitHub — and feel free to share the article or leave a comment with feedback / questions.
- Quickstart:
- Features overview (recommended):
- Example projects:
- GitHub repository: