프로덕션의 백그라운드 작업: 큐가 해결하지 못하는 문제들

발행: 1일 전 (2026년 3월 8일 PM 08:17 GMT+9)

6 분 소요

원문: Dev.to

Source: Dev.to

Cover image for Background Jobs in Production: The Problems Queues Don’t Solve

요청 경로에서 작업을 분리하는 것은 백엔드 시스템을 빠르게 만드는 가장 일반적인 방법 중 하나입니다.

이메일은 비동기적으로 전송됩니다.
청구서는 워커에 의해 생성됩니다.
웹훅은 큐를 통해 전달됩니다.
이미지 처리와 인덱싱은 백그라운드 작업으로 실행됩니다.

지연 시간이 즉시 개선되지만, 많은 팀이 결국 프로덕션에서 이상한 동작을 발견합니다:

중복된 이메일이 나타납니다
재시도로 인해 시스템 부하가 증가합니다
데드레터 큐가 서서히 증가합니다
워크플로우는 기술적으로 “성공”하지만… 결과가 잘못되었습니다

큐는 정상입니다. 워커가 실행 중입니다. 그러나 시스템은 올바르게 동작하지 않습니다.

작업을 백그라운드로 옮기면 실패가 발생하는 위치가 바뀝니다. 실패 자체가 사라지는 것은 아닙니다.

백그라운드 작업 뒤의 가정

Background job systems are usually introduced with a simple expectation: if a job fails, the queue will retry it until it succeeds.

Queues also provide useful features:

buffering traffic spikes
independent worker scaling
retry handling
isolation from request latency

Because of this, async processing often feels safer than synchronous execution.
But that assumption depends on something rarely guaranteed in production: that running a job multiple times produces the same result as running it once.

“At-Least-Once Delivery”가 실제 의미하는 바

Most queue systems guarantee at‑least‑once delivery. That means the system will try hard to deliver a message—even if it results in duplicate execution. It does not mean:

the job runs exactly once
side effects happen exactly once
messages are processed in order

In other words, the queue protects against message loss, not duplicate work. Once duplicate execution becomes possible, correctness has to come from elsewhere, typically:

idempotent handlers
deduplication keys
explicit state transitions
retry boundaries

Without those protections, the infrastructure is reliable while the workflow is not.

고전적인 실패 시나리오

// Example worker that sends a payment receipt
await emailClient.send(...);

await db.payment.update({
  receiptSentAt: new Date()
});

작업자가 이메일을 보낸 후 데이터베이스를 업데이트하기 전에 크래시가 발생하면 작업이 다시 시도됩니다. 고객은 두 개의 영수증을 받게 됩니다. 큐는 설계대로 정확히 동작했지만 비즈니스 결과는 올바르지 않습니다.

Why Production Systems Break Here

Background job systems introduce two factors that make correctness harder.

1. Duplicate execution

Workers can crash after performing side effects but before acknowledging the message.

2. Time separation

Jobs may execute minutes or hours after they were created, when system state has already changed. Retries often interact with partial state or outdated context.

Source: …

대부분의 팀이 나중에 배우는 설계 규칙

백그라운드 작업은 한 번만 수행되는 작업으로 취급해서는 안 됩니다. 재실행 가능한 명령으로 취급해야 합니다. 모든 핸들러는 다음 상황에서도 안전해야 합니다:

두 번 실행될 때
예상보다 나중에 실행될 때
부분적으로 완료된 후에
순서가 뒤바뀐 경우

이러한 조건이 워크플로를 깨뜨린다면, 재시도는 결국 시스템 동작을 손상시킬 것입니다.

모니터링 함정

팀은 종종 큐 인프라를 모니터링합니다:

큐 깊이
워커 처리량
재시도 횟수
데드레터 볼륨

이러한 메트릭은 중요하지만 다음과 같은 질문에 답하지 못합니다:

사용자가 중복 이메일을 받았나요?
결제가 여러 회계 항목을 생성했나요?
하위 시스템이 충돌하는 업데이트를 받았나요?

큐 대시보드는 워크플로가 잘못된 경우에도 완전히 정상으로 보일 수 있습니다.

Source: …

전체 프로덕션 분석 읽기

이 게시물은 핵심 실패 패턴만 다룹니다. 전체 기사에서는 다음을 설명합니다:

재시도가 장애를 악화시킬 수 있는 이유
멱등성 백그라운드 작업이 설계되는 방식
데드레터 큐가 조용히 증가하는 이유
프로덕션 팀이 큐 깊이 외에 모니터링하는 항목
새로운 백그라운드 작업을 위한 실용적인 롤아웃 체크리스트

👉 전체 기사:

프로덕션의 백그라운드 작업: 큐가 해결하지 못하는 문제들

백그라운드 작업 뒤의 가정

“At-Least-Once Delivery”가 실제 의미하는 바

고전적인 실패 시나리오

Why Production Systems Break Here

1. Duplicate execution

2. Time separation

대부분의 팀이 나중에 배우는 설계 규칙

모니터링 함정

전체 프로덕션 분석 읽기

관련 글

Stateless vs Stateful: 차이를 한 번에 파악하자

프록시 대역폭 최적화: 성능을 희생하지 않고 비용 절감

Observability와 Failure Recovery in Distributed Financial Systems: 올바른 시스템도 깨지는 경우

.NET GC 동작을 관찰 가능하게 만들기: GCExperiment를 구축하면서 배운 점