AI Trading: Lesson Learned #129: Backtest Evaluation Bugs Discovered via Deep Research

Published: 0 month ago (January 10, 2026 at 06:17 PM EST)

2 min read

Source: Dev.to

Context

CEO requested deep research into Anthropic’s “Demystifying evals for AI agents” article to determine if their evaluation framework would improve our trading system.

Discovery

While analyzing the article against our existing infrastructure, the research revealed that we already have more rigorous evaluation infrastructure than typical AI agent evals because we have real financial accountability. However, the research uncovered critical bugs in our existing evaluation system:

Bugs Found and Fixed

Bug 1: Slippage Model Disabled

Location: scripts/run_backtest_matrix.py
Issue: The code claimed slippage_model_enabled: True but all execution costs were hard‑coded to 0.0.
Impact: Backtests overestimated returns by 20–50 % (per slippage_model.py documentation).
Fix: Integrated the actual SlippageModel into backtest execution, applying slippage and fees to trades.

# scripts/run_backtest_matrix.py
slippage_model_enabled = True

# before: execution_cost = 0.0
# after: execution_cost = SlippageModel.calculate(trade)

Bug 2: Win Rate Without Context (ll_118 violation)

Location: scripts/run_backtest_matrix.py and output JSONs
Issue: Win rate was displayed without the accompanying average return, allowing misleading metrics (e.g., “80 % win rate” with –6.97 % avg return).
Impact: False confidence in strategy performance.
Fix: Added avg_return_pct field and win_rate_with_context that always shows both together.

{
  "win_rate": 0.80,
  "avg_return_pct": -6.97,
  "win_rate_with_context": "80 % win rate (‑6.97 % avg return)"
}

Bug 3: Missing Bidirectional Learning

Location: src/analytics/live_vs_backtest_tracker.py
Issue: Tracked live slippage but didn’t sync the observations back to backtest assumptions.
Impact: The same outdated slippage assumptions were reused despite real‑world evidence.
Fix: Added sync_to_backtest_assumptions() method and load_live_slippage_assumptions() for backtests.

# src/analytics/live_vs_backtest_tracker.py
def sync_to_backtest_assumptions():
    live_slippage = load_live_slippage_assumptions()
    backtest_config.update(slippage=live_slippage)

Key Insight

The Anthropic article is useful for evaluating LLM agents (like Claude Code), NOT for trading systems.

Our trading system already includes:

Quantitative metrics (Sharpe, win rate, drawdown)
Survival‑gate validation (95 % capital preservation)
19 historical scenarios, including crash replays
Live vs. backtest tracking
Anomaly detection

The real value of the research was uncovering implementation bugs, not adopting a new evaluation framework.

Prevention

Code must actually implement what documentation claims (e.g., slippage_model_enabled).
Always display average return together with win rate per ll_118.
Implement bidirectional feedback loops from production to testing.
Regularly audit evaluation infrastructure for silent failures.

Files Changed

scripts/run_backtest_matrix.py – Integrated slippage model, added avg_return_pct.
src/analytics/live_vs_backtest_tracker.py – Added bidirectional learning functions.

AI Trading: Lesson Learned #129: Backtest Evaluation Bugs Discovered via Deep Research

Context

Discovery

Bugs Found and Fixed

Bug 1: Slippage Model Disabled

Bug 2: Win Rate Without Context (ll_118 violation)

Bug 3: Missing Bidirectional Learning

Key Insight

Prevention

Files Changed

Tags

Related posts

Measuring What Matters: Adding Multiple Dimension Sets to AWS Lambda Powertools

Mastering Django Image Migrations: Local to S3, CDNs, and Beyond!

The Silent Registration Killer: When Auto-Formatters and Linters Collide

FastAPI from Zero: Writing Your First API Route