Tenacious-Bench v0.1: a small B2B sales-outreach benchmark with contamination checks

Published: 2 days ago (May 2, 2026 at 06:22 AM EDT)

2 min read

Source: Dev.to

Overview

General sales benchmarks often miss how real outbound agents fail: over‑claiming on weak signals, unsafe “bench” commitments, tone that drifts into pushy follow‑ups, and gaps between what the rep promises and what delivery can support. For a class project (TRP1 Week 11) I built Tenacious‑Bench v0.1, a compact, machine‑scored task set aimed at those failure modes—not generic helpfulness.

What’s in the dataset

The public release is on Hugging Face: .
It currently shows 168 rows in the hub viewer, split as:

train: 105 rows
validation: 63 rows

Tasks mix several authoring modes—programmatic sweeps, multi‑LLM synthesis with judge filtering, trace‑informed scenarios, and hand‑authored adversarial cases—so the bench isn’t a single‑generator monoculture.

Each row includes:

Structured inputs (prospect context, stack, headcount, signal confidence, bench availability, etc.)
A candidate outreach payload (subject / body / CTA)
Explicit ground‑truth expectations (e.g., when to hand off vs. qualify)
A versioned scoring rubric so scores are reproducible without hand‑waving

Why contamination and provenance matter

Synthetic benchmarks leak in boring ways: near‑duplicate phrasing across splits, embedding neighbors that are too close, or “eval” tasks that are effectively the same scenario as training with a date tweak. I run:

n‑gram overlap checks
Embedding similarity analysis
An explicit signal‑window / provenance policy (train/dev vs. held‑out time labeling)

Outcomes are recorded in a JSON report in the repository. The goal isn’t perfection—it’s to make leakage visible and actionable.

Training angle (Path B)

I’m not publishing a giant SFT corpus here; the project emphasizes a preference‑style critic path (ORPO/DPO‑style data preparation + LoRA training) to catch inconsistency and unsafe commitments. The dataset is the artifact reviewers can actually load; training code and logs live alongside the project README.

Limitations (stated plainly)

Tasks are synthetic and English‑first; they don’t replace live A/B tests or compliance review.
The bench is meant as a regression harness for product teams iterating on sales agents, not as proof of real‑world lift.

Call to action

If you’re building outbound agents, try grading your model on a slice of these tasks and compare against your internal rubric. I’m especially interested in cases where the model is “fluent” but violates bench/signal safety—those are the rows worth expanding next.

Tenacious-Bench v0.1: a small B2B sales-outreach benchmark with contamination checks

Overview

What’s in the dataset

Why contamination and provenance matter

Training angle (Path B)

Limitations (stated plainly)

Call to action

Related posts

AI on Legacy Systems - What the Integration Layer Actually Looks Like

What (un)exactly do you mean by semantic search?

The New AI Tools Quietly Replacing Half Your Dev Workflow (And What To Do About It)

Compute Arbitrage: Why API Routing Is the Next Big Infrastructure Play

Overview

What’s in the dataset

Why contamination and provenance matter

Training angle (Path B)

Limitations (stated plainly)

Call to action

Related posts

AI on Legacy Systems - What the Integration Layer Actually Looks Like

What (un)exactly do you mean by semantic search?

The New AI Tools Quietly Replacing Half Your Dev Workflow (And What To Do About It)

Compute Arbitrage: Why API Routing Is the Next Big Infrastructure Play

Training angle (Path B)