A Natural Language Interface for Datadog Log Search
Source: Dev.to
Introduction
It’s 2 AM. PagerDuty fires. Something’s wrong with the payment service.
You open Log Explorer and stare at the query bar. Is it service:payment or @service:payment? Does negation use NOT or -? What’s the facet for authentication failures again?
The logs have the answer, but the syntax is the bottleneck. This post walks through building a tool that translates plain English into valid Datadog Log Search queries and breaks down the Datadog‑specific gotchas that make this problem interesting.
Where Log Search Syntax Trips People Up
@ Prefix Rule
Reserved attributes do not use @. These are the core fields Datadog provides:
service:payment-service
status:error
host:web-server-01
Custom facets and log attributes require @. Anything you’ve indexed yourself:
@http.status_code:500
@duration:>2000000000
@error.message:*timeout*
Getting the prefix backwards returns no results—no error, just an empty set. This is especially frustrating when debugging under pressure.
Duration Gotcha
Datadog stores duration in nanoseconds, not seconds or milliseconds.
Filtering for requests over 2 seconds:
@duration:>2000000000
Missing a zero filters for 200 ms; adding an extra zero looks for 20‑second requests. During an incident, this mistake costs time.
Less‑Common Facets
When working with Cloud SIEM or audit logs, you’ll encounter facets like:
@evt.name:authentication
@evt.outcome:failure
@network.client.geoip.country_name:*
These aren’t used daily, so they never stick in memory, yet they are exactly the queries you need when investigating suspicious activity.
Prompt Engineering for LLMs
LLMs have seen Datadog queries in training data, but not enough to be reliable. They often generate plausible‑looking syntax that’s subtly wrong. The fix is a system prompt that is explicit about the rules:
Reserved attributes (no @ prefix)
service:payment-service
status:error
host:web-server-01
Facets and custom attributes (@ prefix required)
@http.status_code:500
@duration:>1000000000 # nanoseconds
@error.message:*timeout*
Common mistakes to avoid
@service:payment– wrong (reserved attribute)@duration:>2– wrong (not in nanoseconds)
The nanoseconds callout is critical; without it the model may generate @duration:>2 for “requests over 2 seconds,” which is completely wrong.
Security Patterns
Explicit examples are needed for facets that aren’t guessable:
# Authentication failures
@evt.name:authentication @evt.outcome:failure
# CloudTrail console logins
source:cloudtrail @evt.name:ConsoleLogin
# External IPs only
NOT @network.client.ip:10.* NOT @network.client.ip:192.168.*
These prompts get you to roughly 80 % accuracy. The remaining 20 % are edge cases, obscure facet names, integration‑specific attributes, and syntax variations that a static prompt can’t cover.
Retrieval‑Augmented Generation
To handle the edge cases, we index Datadog’s documentation and retrieve relevant sections at query time, injecting them into the prompt.
Dual Retrieval Methods
- Dense embeddings (e.g., OpenAI
text-embedding-3-large) capture semantic similarity. - Sparse embeddings (e.g., SPLADE) capture exact keyword overlap, ensuring strings like
@evt.outcomeare found.
We merge the two result sets with Reciprocal Rank Fusion (RRF):
results = qdrant_client.query_points(
collection_name=collection,
prefetch=[
Prefetch(query=dense_vector, using="dense", limit=limit * 2),
Prefetch(query=sparse_vector, using="sparse", limit=limit * 2),
],
query=FusionQuery(fusion=Fusion.RRF),
limit=limit,
)
The combination surfaces both conceptual matches and exact syntax matches, covering the edge cases that broke the static prompt.
Using the System
Explanation as a Validation Tool
When inheriting a dashboard with a complex query:
@evt.name:authentication @evt.outcome:failure \
NOT @network.client.ip:10.* NOT @network.client.ip:192.168.* \
NOT @network.client.ip:172.16.*
You can ask the assistant “what does this do?” and receive:
“Failed authentication attempts from IPs outside your internal network ranges.”
This accelerates understanding and lets you validate generated queries by asking the model to explain them.
Natural‑Language → Log Search Examples
| Input (natural language) | Generated Log Search query |
|---|---|
| “Errors from the payment service” | service:payment-service status:error |
| “Slow requests over 2 seconds” | @duration:>2000000000 |
| “Failed logins from external IPs” | @evt.name:authentication @evt.outcome:failure NOT @network.client.ip:10.* NOT @network.client.ip:192.168.* |
Security queries show the highest value: engineers memorize observability queries (e.g., service:api status:error) but rarely use SIEM‑style facets, making the tool especially useful.
Conclusion
The interesting part of this project wasn’t the LLM integration—it’s straightforward—but learning the Datadog‑specific details deeply enough to teach them to a model. The @ prefix rule, the nanoseconds gotcha, and the security facet patterns separate a query that works from one that silently fails. Encoding that knowledge explicitly and augmenting it with retrieved documentation makes the tool reliable.
If you want to explore the implementation, the code is on GitHub.
Views expressed are my own and do not represent my employer.