LLM evaluation guide: When to add online evals to your AI application
Source: Dev.to
Original article – published November 13, 2025
The quick decision framework
Online evals for AI Configs is currently in closed beta. Judges must be installed in your project before they can be attached to AI Config variations.
Online evals provide real‑time quality monitoring for LLM applications. Using LLM‑as‑a‑judge methodology, they run automated quality checks on a configurable percentage of your production traffic, producing structured scores and pass/fail judgments you can act on programmatically. LaunchDarkly includes three built‑in judges:
- accuracy
- relevance
- toxicity
Skip online evals if
- Your checks are purely deterministic (schema validation, compile tests)
- You have low volume and can manually review outputs in observability dashboards
- You’re primarily debugging execution problems
Add online evals when
- You need quantified quality scores to trigger automated actions (rollback, rerouting, alerts)
- Manual quality review doesn’t scale to your traffic volume
- You’re measuring multiple quality dimensions (accuracy, relevance, toxicity)
- You want statistical quality trends across segments for AI governance and compliance
- You need to monitor token usage and cost alongside quality metrics
- You’re running A/B tests or guarded releases and need automated quality gates
Most teams add them within 2‑3 sprints when manual quality review becomes the bottleneck. Configurable sampling rates let you balance evaluation coverage with cost and latency.
Online evals vs. LLM observability
LLM observability shows you what happened. Online evals automatically assess quality and trigger actions based on those assessments.
LLM observability: your security camera
LLM observability shows everything that happened through distributed tracing: full conversations, tool calls, token usage, latency breakdowns, and cost attribution. It’s perfect for debugging and understanding what went wrong. But when you’re handling 10 000 conversations daily, manually reviewing them for quality patterns doesn’t scale.
Online evals: your security guard
Automatically scores every sampled request using LLM‑as‑a‑judge methodology across your quality rubric (accuracy, relevance, toxicity) and takes action. Instead of exporting conversations to spreadsheets for manual review, you get real‑time quality monitoring with drift detection that triggers alerts, rollbacks, or rerouting.
The 3 AM difference
- Without evals: “Let’s meet tomorrow to review samples and decide if we should rollback.”
- With evals: “Quality dropped below threshold, automatic rollback triggered, here’s what failed…”
How online evals actually work
LaunchDarkly’s online evals use LLM‑as‑a‑judge methodology with three built‑in judges you can configure directly in the dashboard—no code changes required.
Getting started
- Install judges from the AI Configs menu.
- Attach judges to AI Config variations.
- Configure sampling rates (balance coverage with cost/latency).
- Evaluation metrics are automatically emitted as custom events.
- Metrics are automatically available for A/B tests and guarded releases.
What you get from each built‑in judge
Accuracy judge
{
"score": 0.85,
"reasoning": "Response correctly answered the question but missed one edge case regarding error handling"
}
Relevance judge
{
"score": 0.92,
"reasoning": "Response directly addressed the user's query with appropriate context and examples"
}
Toxicity judge
{
"score": 0.0,
"reasoning": "Content is professional and appropriate with no toxic language detected"
}
Each judge returns a score from 0.0 to 1.0 plus reasoning that explains the assessment. The built‑in judges have fixed evaluation criteria and are configured only by selecting the provider and model.
Configuration
- Install judges from the AI Configs menu in your LaunchDarkly dashboard.
- They appear as pre‑configured AI configs (e.g., AI Judge – Accuracy).
- When configuring your AI Config variations in completion mode, select which judges to attach and set the desired sampling rate.
- Use different judge combinations for different environments to match quality requirements and cost constraints.
Real problems online evals solve
-
Scale for production applications – Your SQL generator handles 50 000 queries daily. Observability shows every query; online evals tell you the proportion that are semantically wrong, automatically, with hallucination detection built in.
-
Multi‑dimensional quality monitoring – Customer‑service AI isn’t just “did it respond?” It must be accurate, relevant, non‑toxic, compliant, and appropriate. Online evals score all dimensions simultaneously, each with its own threshold and reasoning.
-
RAG pipeline validation – Retrieval‑augmented generation systems need continuous monitoring of both retrieval quality and generation accuracy. Online evals assess whether retrieved context is relevant and whether the response correctly uses that context, preventing hallucinations and ensuring factual grounding.
-
Cost and performance optimization – Monitor token usage alongside quality metrics. If certain queries consume 10× more tokens than others, online evals help identify these patterns so you can optimize prompts or routing logic to reduce costs without sacrificing quality.
-
Actionable metrics for AI governance – Transform 10 000 responses from data to decisions with evaluator‑driven quality gates:
- Accuracy trending below 0.8? Automate a rollback.
- Relevance dropping under 0.7? Trigger a reroute to a fallback model.
- Toxicity spikes above 0.1? Raise an alert for immediate human review.
Alerts to the team
- Toxicity above 0.2? Immediate review and potential rollback.
- Relevance dropping for specific user segments? Targeted configuration updates.
- Metrics automatically feed A/B tests and guarded releases for continuous improvement.
Example implementation path
Week 1‑2: Define quality dimensions and install judges
- Use LLM observability alone at first. Manually review samples to understand your system.
- Define your quality dimensions (e.g., accuracy, relevance, toxicity, or any other criteria specific to your application).
- Install the built‑in judges from the AI Configs menu in LaunchDarkly.
Week 3‑4: Attach judges with sampling
- Attach judges to AI Config variations in LaunchDarkly.
- Start with one or two key judges (accuracy and relevance are good defaults).
- Configure sampling rates between 10 %–20 % of traffic to balance coverage with cost and latency.
- Compare automated scores with human judgment to validate that the judges work for your use case.
Week 5+: Operationalize with quality gates
- Add more evaluation dimensions as you learn.
- Connect scores to automated actions and evaluator‑driven quality gates:
- When accuracy drops below 0.7, trigger alerts.
- When toxicity exceeds 0.2, investigate immediately.
- Leverage custom events and metrics for A/B testing and guarded releases to continuously improve your application’s performance.
The bottom line
- You don’t need online evaluations on day 1. Start with LLM observability to understand your AI system through distributed tracing.
- Add evaluations when you hear yourself saying, “We need to review more conversations,” or “How do we know if quality is degrading?”
LaunchDarkly’s three built‑in judges (accuracy, relevance, toxicity) provide LLM‑as‑a‑judge evaluation that you can attach to any AI Config variation in completion mode with configurable sampling rates.
Note: Online evaluations currently only work with completion‑mode AI Configs. Agent‑based configs are not yet supported.
Evaluation metrics are automatically emitted as custom events and feed directly into A/B tests and guarded releases, enabling continuous AI governance and quality improvement without code changes.
LLM observability is your security camera. Online evals are your security guard.
Next steps
Ready to get started?
Sign up for a free LaunchDarkly account if you haven’t already.
Build a complete quality pipeline
- AI Config CI/CD Pipeline – Add automated quality gates and LLM‑as‑a‑judge testing to your deployment process.
- Combine offline evaluation (in CI/CD) with online evals (in production) for comprehensive quality coverage.
Learn more about AI Configs
- AI Config documentation – Understand how AI Configs enable real‑time LLM configuration.
- Online evals documentation – Deep dive into judge installation and configuration.
- Guardrail metrics – Monitor quality during A/B tests and guarded releases.
See it in action
- LLM observability in the LaunchDarkly dashboard – Track your AI application performance with distributed tracing.
Industry standards
LaunchDarkly’s approach aligns with emerging AI observability standards, including OpenTelemetry’s semantic conventions for AI monitoring, ensuring your evaluation infrastructure integrates with the broader observability ecosystem.