Tracing Discord's Elixir Systems (Without Melting Everything)
Source: Discord Engineering
Background
At Discord, we aim for chat, reactions, and meme posting to feel instantaneous. We achieve this at scale by leveraging Elixir’s powerful concurrency mechanisms to run each Discord server (a “guild”) fully independently from one another.
Observability Workflow
When a guild can’t keep up with user activity, it may feel laggy or experience a complete outage. If the system degrades beyond the point it can self‑heal, an on‑call engineer intervenes and then uses our observability tools to understand the cause and prevent recurrence.
The investigation begins by looking at metrics and logs. We instrument a wide array of measurements, including how frequently each user‑action type is processed and how long processing takes. These metrics often reveal bursty activity—such as a flurry of hype and reactions on a newly released game—but they don’t fully capture the user experience. It’s similar to a car’s dashboard: it can tell you the engine temperature but not the consequences of it running hot.
Guild Timings
If metrics don’t yield results, the on‑call engineer turns to our custom‑built tool called guild timings. Every time a guild processes an action, it records how much of the current minute has been spent on each action type to an in‑memory store. This data is far more detailed than our metrics, but it is emitted at such a high volume that we can’t store it all. Consequently, the data is rotated frequently for all but our largest guilds. Even when retrieved promptly, it still doesn’t provide a complete picture of the end‑to‑end experience because it doesn’t capture downstream effects.
Distributed Tracing
Other teams at Discord have derived enormous value from utilizing distributed tracing (Application Performance Monitoring), which shows how long each constituent part of an operation took. Adding tracing to our Elixir stack required additional work. Most tracing tools propagate operation information via metadata layers like HTTP headers, but Elixir’s built‑in communication tools lack an equivalent layer out of the box.
Implementation
We built our own tracing layer to propagate metadata across services. Despite changing how our services communicate, we integrated the solution without downtime.