Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task

Published: 1 week ago (December 7, 2025 at 03:25 PM EST)

3 min read

Source: Dev.to

Problem Overview

Task	Data Format	Output	Metrics
Classification	Folder structure	Label index	Accuracy, F1
Detection	COCO / YOLO JSON / TXT	Bounding boxes	mAP
Segmentation	PNG masks	Pixel‑level mask	IoU

The initial state featured:

Non‑uniform pipelines
Model‑specific scripts
Benchmark‑specific code paths
No consistent evaluation flow

To build a real benchmarking platform—not just a collection of scripts—we needed a unified execution model.

1. A Declarative Approach: One YAML Defines the Entire Benchmark

The first architectural decision was to replace hard‑coded logic with a declarative configuration model. Each benchmark is defined by a single YAML file that specifies:

Task type
Dataset format and paths
Splits (train/val/test)
Evaluation metrics
Runtime parameters (device, batch size, etc.)

Example YAML

task: detection
dataset:
  kind: coco
  root: datasets/fruit
  splits:
    val: val
eval:
  metrics: ["map50", "map"]
  device: auto

Why this matters

The YAML becomes the single source of truth for the entire system.
Adding a new benchmark only requires creating a new YAML file—no code changes, no new scripts, no duplicated logic.
This design directly enables extensibility.

2. YAML Alone Isn’t Enough — Enter Pydantic AppConfig

YAML is flexible but fragile; a typo or missing field can break an evaluation. To enforce correctness, we built a strongly‑typed AppConfig layer using Pydantic models.

Features of AppConfig

Deep validation – types, allowed values, required fields, structural consistency.
Normalization – path resolution, default values, device handling, metric validation.
Deterministic interpretation – converting YAML → stable Python object.
Clear contract – DatasetAdapters, Runners, Metrics, and the UI all rely on the same structured config.

Example Pydantic Models

from pathlib import Path
from typing import Dict, List
from pydantic import BaseModel

class DatasetConfig(BaseModel):
    kind: str
    root: Path
    splits: Dict[str, str]

class EvalConfig(BaseModel):
    metrics: List[str]
    device: str = "auto"
    batch_size: int = 16

class AppConfig(BaseModel):
    task: str
    dataset: DatasetConfig
    eval: EvalConfig

A correct AppConfig guarantees predictable pipeline behavior; an incorrect YAML is caught immediately, before any runner starts executing.

3. Unifying Inconsistent Formats: DatasetAdapters

After validation, the next challenge is handling incompatible dataset formats. We introduced a modular DatasetAdapter layer that converts any dataset into a uniform iteration interface:

for image, target in adapter:
    # model inference
    ...

Available Adapters

ClassificationFolderAdapter
CocoDetectionAdapter
YoloDetectionAdapter
MaskSegmentationAdapter

Each adapter:

Reads the original dataset.
Converts annotations to a normalized structure.
Exposes consistent outputs across tasks.

This eliminates dozens of conditional branches and format‑specific parsing logic.

4. Task Runners: Executing Models Consistently Across Benchmarks

With datasets unified, we built three modular runners:

ClassifierRunner
DetectorRunner
SegmenterRunner

All runners share the same API:

result = runner.run(dataset, model, config)

Each runner handles:

Forward passes
Output normalization
Prediction logging
Metric computation
Artifact generation
Real‑time UI reporting

The design allows any model to run on any benchmark, provided the configuration matches.

5. From Script to System: Client–Server Architecture

To support multiple users and parallel evaluations, the project evolved into a full client–server system.

Server responsibilities

Job scheduling and queue management
Load balancing across workers
Artifact storage (e.g., MinIO)
Model/version tracking
Failure isolation

Client (PyQt) responsibilities

Uploading models
Selecting benchmarks and configuring runs
Viewing real‑time logs
Comparing metrics across runs
Downloading prediction artifacts

This architecture transformed the pipeline into a usable, scalable research tool.

6. Key Engineering Lessons Learned

Configuration should drive execution, not the other way around.
Strong validation (Pydantic) saves hours of debugging.
Adapters normalize complexity and prevent format‑specific logic explosion.
Modular runners make task logic replaceable and easy to extend.
Incremental evaluation is essential for real‑world datasets.
Client–server separation turns a pipeline into a production‑grade system.

Conclusion

By combining:

Declarative YAML configuration
A strongly typed AppConfig layer
Dataset normalization through adapters
Modular runners
Incremental computation
A client–server architecture

we built a unified benchmarking pipeline capable of running any computer‑vision model on any benchmark—without writing new code for each task. This approach provides stability, extensibility, and reproducibility—essential qualities for a real‑world evaluation system.