LLM evaluation guide: When to add online evals to your AI application

Published: 1 month ago (December 17, 2025 at 12:42 PM EST)

6 min read

Source: Dev.to

Original article – published November 13, 2025

The quick decision framework

Online evals for AI Configs is currently in closed beta. Judges must be installed in your project before they can be attached to AI Config variations.

Online evals provide real‑time quality monitoring for LLM applications. Using LLM‑as‑a‑judge methodology, they run automated quality checks on a configurable percentage of your production traffic, producing structured scores and pass/fail judgments you can act on programmatically. LaunchDarkly includes three built‑in judges:

accuracy
relevance
toxicity

Skip online evals if

Your checks are purely deterministic (schema validation, compile tests)
You have low volume and can manually review outputs in observability dashboards
You’re primarily debugging execution problems

Add online evals when

You need quantified quality scores to trigger automated actions (rollback, rerouting, alerts)
Manual quality review doesn’t scale to your traffic volume
You’re measuring multiple quality dimensions (accuracy, relevance, toxicity)
You want statistical quality trends across segments for AI governance and compliance
You need to monitor token usage and cost alongside quality metrics
You’re running A/B tests or guarded releases and need automated quality gates

Most teams add them within 2‑3 sprints when manual quality review becomes the bottleneck. Configurable sampling rates let you balance evaluation coverage with cost and latency.

Online evals vs. LLM observability

LLM observability shows you what happened. Online evals automatically assess quality and trigger actions based on those assessments.

LLM observability: your security camera

LLM observability shows everything that happened through distributed tracing: full conversations, tool calls, token usage, latency breakdowns, and cost attribution. It’s perfect for debugging and understanding what went wrong. But when you’re handling 10 000 conversations daily, manually reviewing them for quality patterns doesn’t scale.

Online evals: your security guard

Automatically scores every sampled request using LLM‑as‑a‑judge methodology across your quality rubric (accuracy, relevance, toxicity) and takes action. Instead of exporting conversations to spreadsheets for manual review, you get real‑time quality monitoring with drift detection that triggers alerts, rollbacks, or rerouting.

The 3 AM difference

Without evals: “Let’s meet tomorrow to review samples and decide if we should rollback.”
With evals: “Quality dropped below threshold, automatic rollback triggered, here’s what failed…”

How online evals actually work

LaunchDarkly’s online evals use LLM‑as‑a‑judge methodology with three built‑in judges you can configure directly in the dashboard—no code changes required.

Getting started

Install judges from the AI Configs menu.
Attach judges to AI Config variations.
Configure sampling rates (balance coverage with cost/latency).
Evaluation metrics are automatically emitted as custom events.
Metrics are automatically available for A/B tests and guarded releases.

What you get from each built‑in judge

Accuracy judge

{
  "score": 0.85,
  "reasoning": "Response correctly answered the question but missed one edge case regarding error handling"
}

Relevance judge

{
  "score": 0.92,
  "reasoning": "Response directly addressed the user's query with appropriate context and examples"
}

Toxicity judge

{
  "score": 0.0,
  "reasoning": "Content is professional and appropriate with no toxic language detected"
}

Each judge returns a score from 0.0 to 1.0 plus reasoning that explains the assessment. The built‑in judges have fixed evaluation criteria and are configured only by selecting the provider and model.

Configuration

Install judges from the AI Configs menu in your LaunchDarkly dashboard.
They appear as pre‑configured AI configs (e.g., AI Judge – Accuracy).
When configuring your AI Config variations in completion mode, select which judges to attach and set the desired sampling rate.
Use different judge combinations for different environments to match quality requirements and cost constraints.

Real problems online evals solve

Scale for production applications – Your SQL generator handles 50 000 queries daily. Observability shows every query; online evals tell you the proportion that are semantically wrong, automatically, with hallucination detection built in.
Multi‑dimensional quality monitoring – Customer‑service AI isn’t just “did it respond?” It must be accurate, relevant, non‑toxic, compliant, and appropriate. Online evals score all dimensions simultaneously, each with its own threshold and reasoning.
RAG pipeline validation – Retrieval‑augmented generation systems need continuous monitoring of both retrieval quality and generation accuracy. Online evals assess whether retrieved context is relevant and whether the response correctly uses that context, preventing hallucinations and ensuring factual grounding.
Cost and performance optimization – Monitor token usage alongside quality metrics. If certain queries consume 10× more tokens than others, online evals help identify these patterns so you can optimize prompts or routing logic to reduce costs without sacrificing quality.
Actionable metrics for AI governance – Transform 10 000 responses from data to decisions with evaluator‑driven quality gates:
- Accuracy trending below 0.8? Automate a rollback.
- Relevance dropping under 0.7? Trigger a reroute to a fallback model.
- Toxicity spikes above 0.1? Raise an alert for immediate human review.

Alerts to the team

Toxicity above 0.2? Immediate review and potential rollback.
Relevance dropping for specific user segments? Targeted configuration updates.
Metrics automatically feed A/B tests and guarded releases for continuous improvement.

Example implementation path

Week 1‑2: Define quality dimensions and install judges

Use LLM observability alone at first. Manually review samples to understand your system.
Define your quality dimensions (e.g., accuracy, relevance, toxicity, or any other criteria specific to your application).
Install the built‑in judges from the AI Configs menu in LaunchDarkly.

Week 3‑4: Attach judges with sampling

Attach judges to AI Config variations in LaunchDarkly.
Start with one or two key judges (accuracy and relevance are good defaults).
Configure sampling rates between 10 %–20 % of traffic to balance coverage with cost and latency.
Compare automated scores with human judgment to validate that the judges work for your use case.

Week 5+: Operationalize with quality gates

Add more evaluation dimensions as you learn.
Connect scores to automated actions and evaluator‑driven quality gates:
- When accuracy drops below 0.7, trigger alerts.
- When toxicity exceeds 0.2, investigate immediately.
Leverage custom events and metrics for A/B testing and guarded releases to continuously improve your application’s performance.

The bottom line

You don’t need online evaluations on day 1. Start with LLM observability to understand your AI system through distributed tracing.
Add evaluations when you hear yourself saying, “We need to review more conversations,” or “How do we know if quality is degrading?”

LaunchDarkly’s three built‑in judges (accuracy, relevance, toxicity) provide LLM‑as‑a‑judge evaluation that you can attach to any AI Config variation in completion mode with configurable sampling rates.

Note: Online evaluations currently only work with completion‑mode AI Configs. Agent‑based configs are not yet supported.

Evaluation metrics are automatically emitted as custom events and feed directly into A/B tests and guarded releases, enabling continuous AI governance and quality improvement without code changes.

LLM observability is your security camera. Online evals are your security guard.

Next steps

Ready to get started?

Build a complete quality pipeline

AI Config CI/CD Pipeline – Add automated quality gates and LLM‑as‑a‑judge testing to your deployment process.
Combine offline evaluation (in CI/CD) with online evals (in production) for comprehensive quality coverage.

Learn more about AI Configs

AI Config documentation – Understand how AI Configs enable real‑time LLM configuration.
Online evals documentation – Deep dive into judge installation and configuration.
Guardrail metrics – Monitor quality during A/B tests and guarded releases.

See it in action

LLM observability in the LaunchDarkly dashboard – Track your AI application performance with distributed tracing.

Industry standards

LaunchDarkly’s approach aligns with emerging AI observability standards, including OpenTelemetry’s semantic conventions for AI monitoring, ensuring your evaluation infrastructure integrates with the broader observability ecosystem.

LLM evaluation guide: When to add online evals to your AI application

The quick decision framework

Skip online evals if

Add online evals when

Online evals vs. LLM observability

LLM observability: your security camera

Online evals: your security guard

How online evals actually work

Getting started

What you get from each built‑in judge

Accuracy judge

Relevance judge

Toxicity judge

Configuration

Real problems online evals solve

Alerts to the team

Example implementation path

Week 1‑2: Define quality dimensions and install judges

Week 3‑4: Attach judges with sampling

Week 5+: Operationalize with quality gates

The bottom line

Next steps

Ready to get started?

Build a complete quality pipeline

Learn more about AI Configs

See it in action

Industry standards

Related posts

How to Use Synthetic Data to Evaluate LLM Prompts: A Step-by-Step Guide

Low-Code LLM Evaluation Framework with n8n: Automated Testing Guide

AI: The Real 10x Productivity Hack

OpenAI rolls out a fun ‘Your Year with ChatGPT’ feature in select countries

The quick decision framework

Skip online evals if

Add online evals when

Online evals vs. LLM observability

LLM observability: your security camera

Online evals: your security guard

How online evals actually work

Getting started

What you get from each built‑in judge

Accuracy judge

Relevance judge

Toxicity judge

Configuration

Real problems online evals solve

Alerts to the team

Example implementation path

Week 1‑2: Define quality dimensions and install judges

Week 3‑4: Attach judges with sampling

Week 5+: Operationalize with quality gates

The bottom line

Next steps

Ready to get started?

Build a complete quality pipeline

Learn more about AI Configs

See it in action

Industry standards

Related posts

How to Use Synthetic Data to Evaluate LLM Prompts: A Step-by-Step Guide

Low-Code LLM Evaluation Framework with n8n: Automated Testing Guide

AI: The Real 10x Productivity Hack

OpenAI rolls out a fun ‘Your Year with ChatGPT’ feature in select countries

Week 1‑2: Define quality dimensions and install judges

Week 3‑4: Attach judges with sampling

Week 5+: Operationalize with quality gates