Tenacious-Bench: Building a Sales Domain Evaluation Benchmark When No Dataset Exists

Published: (May 1, 2026 at 03:13 PM EDT)
2 min read
Source: Dev.to

Source: Dev.to

The Gap

General‑purpose LLM benchmarks like τ²‑Bench evaluate task completion in retail domains—cancelling orders, processing returns, checking inventory. They cannot answer the question a B2B sales team actually needs: does this outreach email say the right thing to the right buyer?

The Audit

We documented eight specific failure modes from real pipeline traces that existing benchmarks miss:

  • Segment misrouting – email pitched to the wrong buyer segment despite correct ICP classification
  • Signal overclaiming – asserting aggressive hiring intent from a single job post
  • Tone drift – condescension or urgency language that violates the style guide
  • Injection edge cases – prompt injection via the prospect notes field bypassing ToneGuard
  • Bench over‑commitment – promising consultant availability not reflected in the current bench summary
  • Competitor gap framing – technically correct gap analysis that reads as arrogant
  • AI maturity mismatch – pitching an ML platform migration to a company with no data layer
  • Multi‑thread leakage – simultaneous outreach to co‑founder and VP leaking context

Each failure mode maps to at least three real traces from our Week 10 pipeline run.

Building the Dataset With No Labeled Data

Tenacious had no historical labeled prospects. We created 202 tasks from scratch using a four‑mode authoring pipeline.

The Training Experiment

Why Path B (preference‑tuned judge)?
Our failure modes are judgment failures, not generation failures. The pipeline already produces fluent, well‑written emails; the issue is that they sometimes target the wrong segment. Supervised fine‑tuning (SFT) would improve surface quality of already‑good emails, but a DPO‑trained judge learns to catch the judgment errors.

Reward Formulation

reward = \beta \times \bigl(\log \pi_{\text{DPO}}(\text{email} \mid \text{prompt}) - \log \pi_{\text{ref}}(\text{email} \mid \text{prompt})\bigr)
  • Held‑out tasks: 25/50 contain a labeling artifact from the LLM synthesis pipeline (GT = FAIL with no failure category). This inflates error counts and suppresses accuracy on synthesis tasks (36 % vs 62 % on programmatic tasks).

Future Work (Tenacious‑Bench v0.2)

  • Add multi‑turn trajectory tasks
  • Persona‑aware tone scoring
  • Live bench inventory validation
  • Double‑validation step for LLM‑synthesis ground truth

Resources

  • Dataset:
  • Judge LoRA:
  • Code repository:
0 views
Back to Blog

Related posts

Read more »