AI Trading: Lesson Learned #129: Backtest Evaluation Bugs Discovered via Deep Research

Published: (January 10, 2026 at 06:17 PM EST)
2 min read
Source: Dev.to

Source: Dev.to

Context

CEO requested deep research into Anthropic’s “Demystifying evals for AI agents” article to determine if their evaluation framework would improve our trading system.

Discovery

While analyzing the article against our existing infrastructure, the research revealed that we already have more rigorous evaluation infrastructure than typical AI agent evals because we have real financial accountability. However, the research uncovered critical bugs in our existing evaluation system:

Bugs Found and Fixed

Bug 1: Slippage Model Disabled

  • Location: scripts/run_backtest_matrix.py
  • Issue: The code claimed slippage_model_enabled: True but all execution costs were hard‑coded to 0.0.
  • Impact: Backtests overestimated returns by 20–50 % (per slippage_model.py documentation).
  • Fix: Integrated the actual SlippageModel into backtest execution, applying slippage and fees to trades.
# scripts/run_backtest_matrix.py
slippage_model_enabled = True

# before: execution_cost = 0.0
# after: execution_cost = SlippageModel.calculate(trade)

Bug 2: Win Rate Without Context (ll_118 violation)

  • Location: scripts/run_backtest_matrix.py and output JSONs
  • Issue: Win rate was displayed without the accompanying average return, allowing misleading metrics (e.g., “80 % win rate” with –6.97 % avg return).
  • Impact: False confidence in strategy performance.
  • Fix: Added avg_return_pct field and win_rate_with_context that always shows both together.
{
  "win_rate": 0.80,
  "avg_return_pct": -6.97,
  "win_rate_with_context": "80 % win rate (‑6.97 % avg return)"
}

Bug 3: Missing Bidirectional Learning

  • Location: src/analytics/live_vs_backtest_tracker.py
  • Issue: Tracked live slippage but didn’t sync the observations back to backtest assumptions.
  • Impact: The same outdated slippage assumptions were reused despite real‑world evidence.
  • Fix: Added sync_to_backtest_assumptions() method and load_live_slippage_assumptions() for backtests.
# src/analytics/live_vs_backtest_tracker.py
def sync_to_backtest_assumptions():
    live_slippage = load_live_slippage_assumptions()
    backtest_config.update(slippage=live_slippage)

Key Insight

The Anthropic article is useful for evaluating LLM agents (like Claude Code), NOT for trading systems.

Our trading system already includes:

  • Quantitative metrics (Sharpe, win rate, drawdown)
  • Survival‑gate validation (95 % capital preservation)
  • 19 historical scenarios, including crash replays
  • Live vs. backtest tracking
  • Anomaly detection

The real value of the research was uncovering implementation bugs, not adopting a new evaluation framework.

Prevention

  • Code must actually implement what documentation claims (e.g., slippage_model_enabled).
  • Always display average return together with win rate per ll_118.
  • Implement bidirectional feedback loops from production to testing.
  • Regularly audit evaluation infrastructure for silent failures.

Files Changed

  • scripts/run_backtest_matrix.py – Integrated slippage model, added avg_return_pct.
  • src/analytics/live_vs_backtest_tracker.py – Added bidirectional learning functions.

Tags

evaluation #backtest #slippage #win-rate #bidirectional-learning #system-audit

This lesson was auto‑published from our AI Trading repository.

More lessons: rag_knowledge/lessons_learned

Back to Blog

Related posts

Read more »