Datadog: Observability Lessons from 50+ AWS Apps
Source: Dev.to
Lesson 1 – Datadog Goes Beyond Observability; It’s a Reliability Tool
While I call myself an Observability practitioner, I’m also an SRE. My end goal is to enable world‑class customer experience for end users, so I rely heavily on Site Reliability Engineering (SRE) concepts. In the world of SRE, we focus on a few pillars:
- Architecture – Reliability comes from strong architectures and design patterns
- Observability – Full‑stack visibility across systems
- SLI/SLO & Error Budgets – Measuring customer experience
- Release & Incident Engineering – Treating operations as a software problem
- Automation – Eliminate, reduce, simplify, and automate
- Resilience Engineering – Chaos engineering and failure testing
- People & Awareness – The human factor in reliability
Observability is a key pillar of reliability engineering. We enable observability so we can measure customer experience. When experience degrades, we can quickly isolate the root cause and resolve it—ideally eliminating the issue promptly. Datadog supports all of the above pillars, which is why I view it as a reliability‑enhancing tool, not just an observability tool.
Lesson 2 – Datadog Is Your Partner: Observability Is a Journey
Generally, we start with keeping the lights on, then make systems observable, correlate data, and finally enable AIOps. It’s a journey. I have published a complete guide to the AWS Observability Maturity Model V2. Datadog is well‑equipped to enable each step of that journey.
Lesson 3 – Datadog SLOs: Measuring Customer Experience
I treat observability as a by‑product of measuring customer experience. The typical flow is:
- Define Service Level Indicators (SLIs) for any app.
- Convert those SLIs into Service Level Objectives (SLOs).
Once you enable Application Performance Monitoring (APM) with Datadog and have logs, metrics, and traces, you can build an SLI dashboard—a single source of truth for your system. Then you convert it to meaningful SLOs in Datadog.
Datadog provides three types of SLOs:
- By count – Good events ÷ total events.
- By monitor uptime – Using a synthetic test to gauge uptime.
- By time slices – Using custom uptime definitions.
If you have SLOs, you already measure customer experience and you’re way ahead of the game.
Lesson 4 – Datadog Real User Monitoring (RUM): Know What Your End Users Are Doing
Observability gives you insight into your system’s internal state, but you also need to know what end users are experiencing. That’s where RUM shines. It not only surfaces metrics related to end‑user experience, but features like Session Replay let you watch exactly what customers are doing. When a customer complains that something isn’t working, you’re only a few steps away from pinpointing the issue with Datadog RUM.
Lesson 5 – Enhance Built‑In Telemetry with Small Code Changes
Datadog works great out‑of‑the‑box, but a few targeted code changes can unlock massive benefits:
- Inject encrypted, important details into sessions so you can filter RUM data by user, product, etc.
- Add custom instrumentation to APM for deeper visibility in hard‑to‑reach corners.
Even modest enhancements can produce “magic” results.
Lesson 6 – Use Datadog Monitors Wisely
At a high level, Datadog monitors fall into these categories:
| Category | Monitor Types |
|---|---|
| Infrastructure & Host Reliability | Metric, Host, Process Check, Live Process, Service Check, Change, Integration |
| Application Performance & Error Detection | APM, Error Tracking, Anomaly, Outlier, Forecast, Composite |
| User Experience & Frontend Reliability | Real User Monitoring, CI & Tests, Network Check |
| Logs, Events & Operational Intelligence | Logs, Event, Watchdog, LLM Observability |
| Network & Dependency Reliability | NDM NetFlow |
| Reliability Objectives & Governance | SLO |
| Observability Data Quality | Data Quality (preview) |
Choose the right monitor for the problem you’re solving.
Lesson 7 – Datadog Scorecards for Observability Governance
We define Datadog systems, leverage Datadog service catalogs, and then enable Datadog scorecards. This provides an automatic way to measure where you stand. Built‑in capabilities are great, and you can always extend them with customizations via the provided APIs.
Key scorecard dimensions:
- Observability Best Practices – Ensure services emit the right signals by validating deployment tracking, log ingestion, and log‑trace correlation.
- Ownership & Documentation – Confirm every service has clear ownership (teams, contacts, repos, docs) to enable fast escalation and effective incident response.
- Production Readiness – Verify services are operationally ready by checking recent deployments, active monitors, on‑call coverage, and defined SLOs.
Lesson 8 – Build Incident Management with Datadog On‑Call & Incident Management
Datadog On‑Call is a one‑stop place for incident and escalation management. You can define teams, on‑call schedules, and escalation policies. It handles on‑call alerting and provides useful metrics. Initially you may see a lot of noise, but over time you can trim it down to a bare minimum. If you’re already in Datadog, there’s no need for a separate on‑call management solution.
(The original content was cut off here; the core message remains intact.)
Datadog Observability Lessons for AWS
Lesson 9 – Datadog Synthetic Tests
- Purpose: Proactively test your AWS infrastructure.
- Why it matters: You only get telemetry when end‑users are using the system. Synthetic tests mimic those users, giving you visibility even when traffic is low.
- Key points:
- Not just a simple URL check – you can automate full‑stack smoke tests.
- Datadog offers many test locations worldwide, so you can run tests from any region.
Lesson 10 – Datadog CI Visibility & Software Changes
- Purpose: Keep track of what developers are doing.
- How it works: Integrate your CI/CD pipeline so Datadog knows when a team deploys to production.
- Benefits:
- Enable deployment‑version tracking in Datadog APM.
- Compare response times across releases.
- Act on insights proactively.
Lesson 11 – Datadog Workflow Automations
- Purpose: Automate remediation solutions.
- Features:
- Build complex remediation workflows that can be triggered by monitors.
- First step toward “automating your job away.”
- Integrates with almost all AWS services, allowing you to automate AWS infrastructure and other operational workflows.
Lesson 12 – Datadog Code Security
- Purpose: Secure your AWS‑based systems.
- Capabilities:
- SCA – Libraries (Software Composition Analysis)
- SAST – Static Code Analysis
- IAST – Runtime Code Analysis
- Secret Scanning – Detect exposed secrets
- IaC Scanning – Infrastructure‑as‑Code security
- How to start: Integrate your code base with Datadog Code Security – the first step to leveraging its protection.
Lesson 13 – Datadog AI Observability
- Purpose: Measure AI/LLM performance across the stack.
- Why it matters: Modern systems increasingly embed large language models; you need full‑stack AI observability to monitor latency, errors, and resource usage.
Lesson 14 – Datadog Bits AI (SRE Agent)
- Purpose: Provide an on‑call teammate that accelerates root‑cause analysis.
- Highlights:
- Reduces RCA time to a few minutes.
- Leverages complete telemetry, internal system state, end‑user activity, and code behavior to pinpoint issues quickly.
- Excellent at correlating signals faster than manual investigation.
Lesson 15 – Datadog UI
- Purpose: Deliver business‑level visibility to every stakeholder.
- Features:
- Simple, intuitive interface that abstracts complexity.
- Tailored personas for SREs, developers, senior executives, and CTOs.
- Enables organization‑wide transparency and data‑driven decision‑making.
Closing Thoughts
These are some of the key lessons I’ve learned while using Datadog with AWS. There are many more, but this list captures the most impactful capabilities:
- Observability partner: Datadog offers deep, built‑in integrations for AWS.
- Free trial: Start with a 14‑day Datadog trial.
- Cost vs. value: It can be pricey, but the reliability and operational leverage it provides are often worth every penny—especially when you need visibility and reliability at scale.
Give Datadog a try and see how it can transform your AWS observability strategy.