Building a ‘simple’ async service in Rust (and why it wasn’t simple)

Published: 0 month ago (April 6, 2026 at 12:29 PM EDT)

6 min read

Source: Dev.to

Source: Dev.to

I Thought This Async Rust Service Would Be Simple

I wanted to build a small async service in Rust that:

Accepts events
Processes them
Retries on failure

Nothing fancy. It looked like a weekend project, but it quickly turned into a lesson in how quickly “simple” systems stop being simple once you care about correctness.

The full project is available here:

The Naïve Version

The initial design looked something like this:

HTTP → queue → worker pool

Handler receives an event.
Push it into a channel.
Workers pull from the channel and process.

That works fine—until you actually try to make it correct. As soon as you introduce retries, idempotency, and failure handling, things start to break in ways that aren’t obvious at first.

Problem 1: Idempotency Isn’t Just “Don’t Insert Twice”

I wanted ingestion to be idempotent by event_id. At first that meant:

If the ID exists, return the existing record.
Otherwise, insert it.

But that leaves a hole. What if the same ID comes in with a different payload? That’s not a duplicate—it’s a conflict.

Fix: Store a hash of the payload and reject mismatches.

Situation	Result
Same ID + same payload	OK (deduped)
Same ID + different payload	`409 Conflict`

A small change, but it forced me to treat idempotency as a real constraint instead of a convenience.

Problem 2: You Can Lose Work Even If You “Queued” It

Originally I assumed:

If I push an event into the queue, it will eventually be processed.

That’s not actually true. Two things break this:

Queue is full – try_send fails.
Queue is broken – receiver dropped.

In both cases the event exists in the system, but it never reaches a worker.

Fix: Separate “exists” from “scheduled.” Each record now tracks:

status – (Received, Processing, …)
queued – whether we think it’s scheduled

If enqueue fails, the record still exists, but it isn’t reliably scheduled anymore. Which leads to the next problem.

Problem 3: You Need a Sweeper (Even If It Feels Wrong)

I didn’t initially want a background task scanning state; it felt like a workaround. But without it, there are many ways for events to get stuck:

Enqueue fails
Worker crashes mid‑processing
Retry timing gets missed

Solution: Add a sweeper that runs periodically and looks for:

Events ready to retry
Events marked queued but not processed for too long

It re‑enqueues those events. It’s not elegant, but it’s robust and gives you eventual correctness without requiring every code path to be perfect.

Problem 4: “Queue Depth” Is Not One Number

At first I tracked queue depth as a single value, which turned out to be misleading. There are at least three different things happening:

Metric	Meaning
Channel depth	How many items are currently in the channel
Backlog	How many events are marked `queued == true`
Inflight	How many workers are actively processing

These are not the same. For example:

Channel depth can be 0 while backlog is high.
Inflight can be maxed out while the queue stays empty.

Fix: Split them into separate metrics:

queue_channel_depth
backlog_queued
processing_inflight

Once I did that, the system became much easier to reason about.

Problem 5: Concurrency Needs to Be Bounded Explicitly

The simplest approach is to spawn a task per event. That works—until it doesn’t. I ended up using a Semaphore to limit concurrency:

Each task acquires a permit.
The permit is held for the duration of processing.
Max concurrency is fixed.

Instead of a fixed worker pool, this lets me:

Keep the code simple
Avoid idle workers
Still enforce limits

It also makes shutdown behavior much easier to control.

Problem 6: Graceful Shutdown Is Where Things Get Messy

Stopping a system like this is harder than starting it. You need to:

Stop accepting new work.
Stop dispatching new tasks.
Let in‑flight work finish (within reason).
Not hang forever.

What I ended up with:

A watch channel for shutdown signalling.
A dispatch loop that exits on signal.
A JoinSet tracking worker tasks.
A timeout for draining, followed by forced abort after the timeout.

Shutdown flow

Signal shutdown.
Stop pulling from the queue.
Wait up to N milliseconds for workers to finish.
Abort anything still running.

It’s not perfect, but it’s predictable.

Problem 7: Metrics Will Lie to You If You’re Not Careful

I added metrics early, but they were wrong at first. The issue was trying to track counts by incrementing and decrementing in multiple places—easy to get wrong in a concurrent system.

What worked:

Counters – only ever increment.
State counts – only update on real state transitions.

For example, queued_count only changes when:

queued flips false → true
queued flips true → false

Anything else introduces drift.

The Resulting Model

The final system looks like this:

HTTP → Ingest → Store → Channel → Dispatcher → Workers
                                 ↑
                              Sweeper

State Machine

Received → Processing → Completed
                    ↘ FailedRetry → Failed

Metrics

Ingress
Deduplication
Processing success / failure
Backlog
Queue state
Concurrency
Latency

What I Took Away

“Simple async system” is usually not simple once you care about correctness.
State machines make concurrency problems easier to reason about.
Back‑pressure is multi‑dimensional, not a single number.
A sweeper is often the simplest way to guarantee eventual progress.
Shutdown needs to be designed, not added later.
Observability changes how you design the system.

What I Didn’t Do (On Purpose)

… (the original post was truncated here; you can fill in any intentional omissions you made).

# Overview

`s` is an in‑memory system.

I didn’t add:

- persistence  
- distributed processing  
- external queues  

Those would be the next steps, but the goal here was to get the core behavior right first.

---

## Closing

This ended up being more about edge cases than features.

Most of the code is just making sure the system behaves correctly when things don’t go as planned — which is most of the time in real systems.

That was the interesting part.  

And honestly, the part I didn’t expect going in.

---

## Code

If you want to see the full implementation:

[https://github.com/yourname/eventful](https://github.com/yourname/eventful)

Building a ‘simple’ async service in Rust (and why it wasn’t simple)

I Thought This Async Rust Service Would Be Simple

The Naïve Version

Problem 1: Idempotency Isn’t Just “Don’t Insert Twice”

Problem 2: You Can Lose Work Even If You “Queued” It

Problem 3: You Need a Sweeper (Even If It Feels Wrong)

Problem 4: “Queue Depth” Is Not One Number

Problem 5: Concurrency Needs to Be Bounded Explicitly

Problem 6: Graceful Shutdown Is Where Things Get Messy

Problem 7: Metrics Will Lie to You If You’re Not Careful

The Resulting Model

State Machine

Metrics

What I Took Away

What I Didn’t Do (On Purpose)

Related posts

How to Use rs-trafilatura with spider-rs

Building a Decentralized Mesh Network in Rust — Lessons from the Global South

Show HN: Contrapunk – Real-time counterpoint harmony from guitar input, in Rust

Building a Google OAuth CLI in Rust with PKCE (and surviving the borrow checker)

I Thought This Async Rust Service Would Be Simple

The Naïve Version

Problem 1: Idempotency Isn’t Just “Don’t Insert Twice”

Problem 2: You Can Lose Work Even If You “Queued” It

Problem 3: You Need a Sweeper (Even If It Feels Wrong)

Problem 4: “Queue Depth” Is Not One Number

Problem 5: Concurrency Needs to Be Bounded Explicitly

Problem 6: Graceful Shutdown Is Where Things Get Messy

Problem 7: Metrics Will Lie to You If You’re Not Careful

The Resulting Model

State Machine

Metrics

What I Took Away

What I Didn’t Do (On Purpose)

Related posts

How to Use rs-trafilatura with spider-rs

Building a Decentralized Mesh Network in Rust — Lessons from the Global South

Show HN: Contrapunk – Real-time counterpoint harmony from guitar input, in Rust

Building a Google OAuth CLI in Rust with PKCE (and surviving the borrow checker)

Problem 1: Idempotency Isn’t Just “Don’t Insert Twice”

Problem 2: You Can Lose Work Even If You “Queued” It

Problem 3: You Need a Sweeper (Even If It Feels Wrong)

Problem 4: “Queue Depth” Is Not One Number

Problem 5: Concurrency Needs to Be Bounded Explicitly

Problem 6: Graceful Shutdown Is Where Things Get Messy

Problem 7: Metrics Will Lie to You If You’re Not Careful