Tenacious-Bench v0.1: a small B2B sales-outreach benchmark with contamination checks
Source: Dev.to
Overview
General sales benchmarks often miss how real outbound agents fail: over‑claiming on weak signals, unsafe “bench” commitments, tone that drifts into pushy follow‑ups, and gaps between what the rep promises and what delivery can support. For a class project (TRP1 Week 11) I built Tenacious‑Bench v0.1, a compact, machine‑scored task set aimed at those failure modes—not generic helpfulness.
What’s in the dataset
The public release is on Hugging Face: .
It currently shows 168 rows in the hub viewer, split as:
- train: 105 rows
- validation: 63 rows
Tasks mix several authoring modes—programmatic sweeps, multi‑LLM synthesis with judge filtering, trace‑informed scenarios, and hand‑authored adversarial cases—so the bench isn’t a single‑generator monoculture.
Each row includes:
- Structured inputs (prospect context, stack, headcount, signal confidence, bench availability, etc.)
- A candidate outreach payload (subject / body / CTA)
- Explicit ground‑truth expectations (e.g., when to hand off vs. qualify)
- A versioned scoring rubric so scores are reproducible without hand‑waving
Why contamination and provenance matter
Synthetic benchmarks leak in boring ways: near‑duplicate phrasing across splits, embedding neighbors that are too close, or “eval” tasks that are effectively the same scenario as training with a date tweak. I run:
- n‑gram overlap checks
- Embedding similarity analysis
- An explicit signal‑window / provenance policy (train/dev vs. held‑out time labeling)
Outcomes are recorded in a JSON report in the repository. The goal isn’t perfection—it’s to make leakage visible and actionable.
Training angle (Path B)
I’m not publishing a giant SFT corpus here; the project emphasizes a preference‑style critic path (ORPO/DPO‑style data preparation + LoRA training) to catch inconsistency and unsafe commitments. The dataset is the artifact reviewers can actually load; training code and logs live alongside the project README.
Limitations (stated plainly)
- Tasks are synthetic and English‑first; they don’t replace live A/B tests or compliance review.
- The bench is meant as a regression harness for product teams iterating on sales agents, not as proof of real‑world lift.
Call to action
If you’re building outbound agents, try grading your model on a slice of these tasks and compare against your internal rubric. I’m especially interested in cases where the model is “fluent” but violates bench/signal safety—those are the rows worth expanding next.