AI Trading: Lesson Learned #129: Backtest Evaluation Bugs Discovered via Deep Research
Source: Dev.to
Context
CEO requested deep research into Anthropic’s “Demystifying evals for AI agents” article to determine if their evaluation framework would improve our trading system.
Discovery
While analyzing the article against our existing infrastructure, the research revealed that we already have more rigorous evaluation infrastructure than typical AI agent evals because we have real financial accountability. However, the research uncovered critical bugs in our existing evaluation system:
Bugs Found and Fixed
Bug 1: Slippage Model Disabled
- Location:
scripts/run_backtest_matrix.py - Issue: The code claimed
slippage_model_enabled: Truebut all execution costs were hard‑coded to0.0. - Impact: Backtests overestimated returns by 20–50 % (per
slippage_model.pydocumentation). - Fix: Integrated the actual
SlippageModelinto backtest execution, applying slippage and fees to trades.
# scripts/run_backtest_matrix.py
slippage_model_enabled = True
# before: execution_cost = 0.0
# after: execution_cost = SlippageModel.calculate(trade)
Bug 2: Win Rate Without Context (ll_118 violation)
- Location:
scripts/run_backtest_matrix.pyand output JSONs - Issue: Win rate was displayed without the accompanying average return, allowing misleading metrics (e.g., “80 % win rate” with –6.97 % avg return).
- Impact: False confidence in strategy performance.
- Fix: Added
avg_return_pctfield andwin_rate_with_contextthat always shows both together.
{
"win_rate": 0.80,
"avg_return_pct": -6.97,
"win_rate_with_context": "80 % win rate (‑6.97 % avg return)"
}
Bug 3: Missing Bidirectional Learning
- Location:
src/analytics/live_vs_backtest_tracker.py - Issue: Tracked live slippage but didn’t sync the observations back to backtest assumptions.
- Impact: The same outdated slippage assumptions were reused despite real‑world evidence.
- Fix: Added
sync_to_backtest_assumptions()method andload_live_slippage_assumptions()for backtests.
# src/analytics/live_vs_backtest_tracker.py
def sync_to_backtest_assumptions():
live_slippage = load_live_slippage_assumptions()
backtest_config.update(slippage=live_slippage)
Key Insight
The Anthropic article is useful for evaluating LLM agents (like Claude Code), NOT for trading systems.
Our trading system already includes:
- Quantitative metrics (Sharpe, win rate, drawdown)
- Survival‑gate validation (95 % capital preservation)
- 19 historical scenarios, including crash replays
- Live vs. backtest tracking
- Anomaly detection
The real value of the research was uncovering implementation bugs, not adopting a new evaluation framework.
Prevention
- Code must actually implement what documentation claims (e.g.,
slippage_model_enabled). - Always display average return together with win rate per ll_118.
- Implement bidirectional feedback loops from production to testing.
- Regularly audit evaluation infrastructure for silent failures.
Files Changed
scripts/run_backtest_matrix.py– Integrated slippage model, addedavg_return_pct.src/analytics/live_vs_backtest_tracker.py– Added bidirectional learning functions.
Tags
evaluation #backtest #slippage #win-rate #bidirectional-learning #system-audit
This lesson was auto‑published from our AI Trading repository.
More lessons: rag_knowledge/lessons_learned