How to Design a Notification System: A Complete Guide

Published: (December 6, 2025 at 11:36 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

Introduction

This guide outlines how to build a scalable notification service that supports email, SMS, push, and in‑app channels. It covers user preferences, rate‑limiting, synchronous & batch delivery, queueing with retries, high availability, and the trade‑offs between latency, cost, and reliability.

Notification Types & Use Cases

  • Push notifications – Mobile and desktop alerts via services like FCM or APNs. (Firebase)
  • Email notifications – Transactional emails such as password resets, receipts, or promotions. (SendGrid)
  • SMS notifications – Time‑sensitive alerts like OTPs or delivery updates. (Twilio)
  • In‑app notifications – Alerts that appear inside the app itself, often using real‑time connections like WebSockets.

Typical scenarios

  • User engagement (encouraging return visits)
  • Transaction updates (payments, orders, deliveries)
  • Security alerts (login warnings, password changes)
  • System communication (downtime, maintenance, feature changes)

Core Requirements

RequirementDescription
Multi‑channel supportPush, SMS, email, and in‑app alerts
Guaranteed deliveryReliable sending with retries
User preferencesQuiet hours, preferred channels, opt‑outs
PersonalizationContext‑aware messages (e.g., “Hi John, your package is on the way”)
Retry mechanismResend on failure with back‑off
ScalabilityMillions of notifications per minute
Low latencySeconds‑level delivery for OTPs, security alerts
High availabilityOperate despite failures
Fault toleranceRecover without data loss
ObservabilityMetrics, logs, tracing for delivery status

Challenges

  • High concurrency – Delivering massive volumes in short bursts.
  • Channel complexity – Each channel has distinct failure modes and limits.
  • Delivery guarantees – Choosing between at‑most‑once, at‑least‑once, or exactly‑once semantics.
  • User preferences at scale – Enforcing opt‑in/out and quiet hours efficiently.
  • Failure handling – Retries, exponential back‑off, dead‑letter queues, and fallbacks for external services.

High‑Level Architecture

[Producers] → [Ingress API] → [Message Broker] → [Worker Pool] → [Channel Adapters] → External Providers

Components

Producer (Event Source)

Generates notification events (e.g., order placed, message received, system alert).

Message Broker / Queue

Acts as a buffer between producers and workers. Common choices:

  • Apache Kafka – High‑throughput, replayability, partitioning. (Apache Kafka)
  • RabbitMQ – Flexible routing, suitable for complex patterns.
  • AWS SQS / Google Pub/Sub – Fully managed, lower operational overhead. (DataCamp)

Notification Service (Workers)

  • Reads events from the broker.
  • Applies business logic and checks user preferences.
  • Selects appropriate channel(s).
  • Formats payloads for each channel.

Channel Integrations

  • Push – APNs / FCM adapters.
  • Email – SMTP, SendGrid, or Amazon SES.
  • SMS – Twilio or telecom gateways. (Twilio)

Databases

Store:

  • User preferences and device tokens.
  • Delivery logs, rate‑limits, and notification history.
  • Idempotency keys for exactly‑once processing.

Monitoring & Logging

Collect metrics, dashboards, and tracing data to track delivery success, failures, and retry counts.

Event Payload Example

{
  "event_id": "uuid-v4",
  "event_type": "ORDER_SHIPPED",
  "priority": "MEDIUM",
  "user_id": "user-123",
  "tenant_id": "org-456",
  "timestamp": "2025-12-06T12:34:56Z",
  "payload": {
    "order_id": "order-789",
    "tracking_url": "https://carrier/track/..."
  },
  "channels": ["PUSH", "EMAIL"],  // optional override
  "idempotency_key": "user-123-order-789"
}
  • Queues decouple producers and consumers, provide buffering, and enable back‑pressure.
  • Partitioning key (e.g., user_id % partitions) helps distribute load while preserving per‑user ordering when needed.
  • Dead Letter Queue (DLQ) captures events that repeatedly fail after retries.

Choosing a Queue

QueueStrengthsTypical Use
Apache KafkaVery high throughput, durable log, replayabilityHeavy‑volume streaming pipelines
RabbitMQRich routing, acknowledgmentsComplex routing, moderate scale
AWS SQS / Pub/SubManaged service, simple opsWhen you prefer minimal operational overhead

Channel Integration Details

Push Notifications

  • Use FCM for Android and cross‑platform delivery; it can proxy to APNs for iOS.
  • Store device tokens, handle invalidation, and rotate stale tokens.
  • Respect payload size limits; keep messages concise.

Email Notifications

  • Prefer transactional providers (SendGrid, Amazon SES) for deliverability and reputation.
  • Implement rate‑limiting and back‑off to avoid throttling.

SMS Notifications

  • Use Twilio or carrier gateways for global reach.
  • Account for carrier‑specific rate limits and message length restrictions.

Conclusion

Designing a notification system requires balancing latency, reliability, and cost while supporting multiple channels and respecting user preferences. By structuring the architecture around a robust event pipeline—producer → broker → workers → channel adapters—you can achieve high scalability, fault tolerance, and observability, making the system ready for both interview discussions and real‑world production workloads.


References

Back to Blog

Related posts

Read more »