Your Benchmarks Are Lying to You (And This 148-Star Crate Knows Why)

Published: 2 days ago (February 28, 2026 at 08:11 PM EST)

6 min read

Source: Dev.to

Overview

Microbenchmarks lie – not maliciously, but structurally. You write a tight loop, measure it a thousand times, compare two implementations, and declare a winner. Except your CPU was thermally throttling during the second run, or the OS scheduled a background process halfway through your baseline, or the memory allocator fragmented differently between runs because you ate lunch and came back.

Most benchmarking harnesses deal with this by collecting more samples and hoping statistics will save you: run it ten thousand times instead of a thousand, throw out outliers, compute confidence intervals. It helps, but it’s merely patching a fundamental problem: you measured the baseline and the candidate at different times, under different system conditions.

What if you didn’t have to?

Tango is a Rust micro‑benchmarking harness built by Denis Bazhenov around a simple idea: run the baseline and the candidate together, not sequentially.
Baseline → candidate → baseline → candidate … all within the same process, alternating on every iteration.

Thermal drift affects both equally.
Scheduling jitter affects both equally.

By the time you compare results, you’re comparing two things that experienced the same system conditions at the same moments. The project calls this “paired testing.” It produces tighter confidence intervals and fewer false‑positives than traditional sequential benchmarking.

Stars: ~148 (at time of writing)
License: MIT (implied)

Project Summary

Item	Details
Name	`tango`
Stars	~148
Maintainer	Solo developer, actively committing
Code health	Small, dense, well‑organized
Docs	Solid README with methodology explanation; API docs are thin
Contributor UX	Clear architecture, responsive maintainer, open to contributions
Worth using?	Yes, if you benchmark and care about result stability

The whole workspace is about 3,900 lines of Rust, with the core tango-bench crate at ~3,350 lines – small for the functionality it provides. The architecture justifies its line count.

Technical Highlights

Paired‑Testing Implementation

Dynamic library loading via the libloading crate.
Benchmarks compile as dylibs; Tango loads two copies into the same process.
On Linux, it goes further with GOT/PLT patching (using goblin for ELF parsing) to interpose function calls.
On Windows, it patches the Import Address Table.

This is real systems programming, not a thin wrapper around std::time::Instant.

Benchmarking API

benchmark_fn("my_algorithm", |b| {
    b.iter(|| my_function(1000))
});

Metric selection is generic, chosen via turbofish syntax. The Metric trait is a single method:

pub trait Metric {
    fn measure_fn(f: impl FnMut()) -> u64;
}

The closure is wrapped, measured, and the result returned. The trait is monomorphized, so there’s no v‑table dispatch in the hot path.

WallClock uses std::time::Instant by default (or rdtscp directly with the hw-timer feature flag).
Users can swap metrics per‑benchmark:

b.metric::().iter(|| …);

Dependencies

Crate	Purpose
`clap`	CLI parsing
`rand`	Shuffled iteration orders
`libc` / `windows`	Platform‑specific calls
`goblin` / `scroll`	ELF parsing & patching (Linux)
`alloca`	Stack‑allocated sampling buffers (keeps allocation noise out of measurements)

No unnecessary bloat.

Rough Spots

API documentation beyond the README is sparse.
A lingering Clippy warning in cli.rs (function with too many arguments).
Test coverage is solid for core statistics & measurement code, thinner for CLI and dylib‑loading paths – typical for a focused solo project.

Recent Development: The `Metric` Trait

When Tango added the Metric trait in PR #60, it shipped with a single implementation: WallClock. An earlier PR (#57) proposed switching the default timer to clock_gettime(CLOCK_THREAD_CPUTIME_ID) for per‑thread CPU time, but the maintainer correctly pushed back: CPU time ≠ wall time (a sleep(100 ms) registers 100 ms wall time but near‑zero CPU time).

With the pluggable Metric trait in place, both can coexist.

My Contribution: `CpuTime` Metric

Unix – uses clock_gettime(CLOCK_THREAD_CPUTIME_ID) (nanosecond precision).
Windows – calls GetThreadTimes(GetCurrentThread()) and sums user + kernel time.

No new crates were needed; the implementation lives behind cfg attributes, mirroring WallClock. The change touches two files, ~115 lines total (including tests).

Integration Test (PR #72)

#[test]
fn cpu_time_vs_sleep() {
    // Sleep 50 ms → low CPU time
    // Busy loop → high CPU time
    // Assert busy loop reports ≥10× more CPU time than sleep
}

The test demonstrates the metric’s purpose: thread::sleep consumes wall time but not CPU time.

Outlook

Tango targets Rust developers who benchmark and have been burned by inconsistent results. If you’ve ever seen a 5 % regression that turned out to be your laptop’s fan kicking in, the paired‑testing approach directly addresses that.

The project’s trajectory is clear:

Metric trait landed recently.
Async benchmark support is in progress.
The maintainer engages thoughtfully with PRs and issues.

It’s not a stalled side project – it’s actively evolving, and the clean architecture should support further growth.

What would push it further?

More extensive API documentation (e.g., Rustdoc for the public crate).
Expanded test coverage for Windows‑specific loading/patching paths.
Additional built‑in metrics (e.g., InstructionsRetired, CacheMisses).
Community‑driven plugins for custom statistical analysis.

If you benchmark Rust code and care about result stability, give Tango a try.

Review Bomb #4

More metrics (instruction counts via perf_event_open, anyone?), better API docs, and broader awareness.

The paired‑testing methodology is the kind of idea that, once you understand it, makes sequential benchmarking feel obviously wrong. More people should know about it.

If you benchmark Rust code, go look at tango:

Read the Methodology section of the README.
Run one of the examples.
The paired‑testing approach is worth understanding even if you stick with Criterion for now.

How to get involved

⭐️ Star the repo.
Try it on a real benchmark.
Pick up one of the open issues.

Here’s my PR adding CpuTime if you want to see what a small contribution looks like:

[PR #: Add CpuTime metric](https://github.com///pull/)

Your Benchmarks Are Lying to You (And This 148-Star Crate Knows Why)

Overview

What if you didn’t have to?

Project Summary

Technical Highlights

Paired‑Testing Implementation

Benchmarking API

Dependencies

Rough Spots

Recent Development: The `Metric` Trait

My Contribution: `CpuTime` Metric

Integration Test (PR #72)

Outlook

What would push it further?

Review Bomb #4

How to get involved

Related posts

Semantic Invalidation That Doesn't Suck

Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?

Latency numbers every programmer should know

I built a CLI that catches design inconsistencies — like Lighthouse, but for your design system

Overview

What if you didn’t have to?

Project Summary

Technical Highlights

Paired‑Testing Implementation

Benchmarking API

Dependencies

Rough Spots

Recent Development: The Metric Trait

My Contribution: CpuTime Metric

Integration Test (PR #72)

Outlook

What would push it further?

Review Bomb #4

How to get involved

Related posts

Semantic Invalidation That Doesn't Suck

Profiling GPU (CUDA) — What Is Actually Limiting Your Kernel?

Latency numbers every programmer should know

I built a CLI that catches design inconsistencies — like Lighthouse, but for your design system

Recent Development: The `Metric` Trait

My Contribution: `CpuTime` Metric

Integration Test (PR #72)