Tenacious-Bench: Building a Sales Domain Evaluation Benchmark When No Dataset Exists

Published: 3 days ago (May 1, 2026 at 03:13 PM EDT)

2 min read

Source: Dev.to

The Gap

General‑purpose LLM benchmarks like τ²‑Bench evaluate task completion in retail domains—cancelling orders, processing returns, checking inventory. They cannot answer the question a B2B sales team actually needs: does this outreach email say the right thing to the right buyer?

The Audit

We documented eight specific failure modes from real pipeline traces that existing benchmarks miss:

Segment misrouting – email pitched to the wrong buyer segment despite correct ICP classification
Signal overclaiming – asserting aggressive hiring intent from a single job post
Tone drift – condescension or urgency language that violates the style guide
Injection edge cases – prompt injection via the prospect notes field bypassing ToneGuard
Bench over‑commitment – promising consultant availability not reflected in the current bench summary
Competitor gap framing – technically correct gap analysis that reads as arrogant
AI maturity mismatch – pitching an ML platform migration to a company with no data layer
Multi‑thread leakage – simultaneous outreach to co‑founder and VP leaking context

Each failure mode maps to at least three real traces from our Week 10 pipeline run.

Building the Dataset With No Labeled Data

Tenacious had no historical labeled prospects. We created 202 tasks from scratch using a four‑mode authoring pipeline.

The Training Experiment

Why Path B (preference‑tuned judge)?
Our failure modes are judgment failures, not generation failures. The pipeline already produces fluent, well‑written emails; the issue is that they sometimes target the wrong segment. Supervised fine‑tuning (SFT) would improve surface quality of already‑good emails, but a DPO‑trained judge learns to catch the judgment errors.

Reward Formulation

reward = \beta \times \bigl(\log \pi_{\text{DPO}}(\text{email} \mid \text{prompt}) - \log \pi_{\text{ref}}(\text{email} \mid \text{prompt})\bigr)

Held‑out tasks: 25/50 contain a labeling artifact from the LLM synthesis pipeline (GT = FAIL with no failure category). This inflates error counts and suppresses accuracy on synthesis tasks (36 % vs 62 % on programmatic tasks).

Future Work (Tenacious‑Bench v0.2)

Add multi‑turn trajectory tasks
Persona‑aware tone scoring
Live bench inventory validation
Double‑validation step for LLM‑synthesis ground truth

Resources

Dataset:
Judge LoRA:
Code repository:

Tenacious-Bench: Building a Sales Domain Evaluation Benchmark When No Dataset Exists

The Gap

The Audit

Building the Dataset With No Labeled Data

The Training Experiment

Reward Formulation

Future Work (Tenacious‑Bench v0.2)

Resources

Related posts

The smarter the model, the more it saves.

Caching AI Responses in a Desktop App — Don't Pay Twice for the Same Question

LLM386: borrowing a 1990s idea for managing LLM context

Token Consumption Anxiety and the Open Source App I Built to Solve It

The Gap

The Audit

Building the Dataset With No Labeled Data

The Training Experiment

Reward Formulation

Future Work (Tenacious‑Bench v0.2)

Resources

Related posts

The smarter the model, the more it saves.

Caching AI Responses in a Desktop App — Don't Pay Twice for the Same Question

LLM386: borrowing a 1990s idea for managing LLM context

Token Consumption Anxiety and the Open Source App I Built to Solve It

Future Work (Tenacious‑Bench v0.2)