Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task

Published: (December 7, 2025 at 03:25 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

Problem Overview

TaskData FormatOutputMetrics
ClassificationFolder structureLabel indexAccuracy, F1
DetectionCOCO / YOLO JSON / TXTBounding boxesmAP
SegmentationPNG masksPixel‑level maskIoU

The initial state featured:

  • Non‑uniform pipelines
  • Model‑specific scripts
  • Benchmark‑specific code paths
  • No consistent evaluation flow

To build a real benchmarking platform—not just a collection of scripts—we needed a unified execution model.

1. A Declarative Approach: One YAML Defines the Entire Benchmark

The first architectural decision was to replace hard‑coded logic with a declarative configuration model. Each benchmark is defined by a single YAML file that specifies:

  • Task type
  • Dataset format and paths
  • Splits (train/val/test)
  • Evaluation metrics
  • Runtime parameters (device, batch size, etc.)

Example YAML

task: detection
dataset:
  kind: coco
  root: datasets/fruit
  splits:
    val: val
eval:
  metrics: ["map50", "map"]
  device: auto

Why this matters

  • The YAML becomes the single source of truth for the entire system.
  • Adding a new benchmark only requires creating a new YAML file—no code changes, no new scripts, no duplicated logic.
  • This design directly enables extensibility.

2. YAML Alone Isn’t Enough — Enter Pydantic AppConfig

YAML is flexible but fragile; a typo or missing field can break an evaluation. To enforce correctness, we built a strongly‑typed AppConfig layer using Pydantic models.

Features of AppConfig

  • Deep validation – types, allowed values, required fields, structural consistency.
  • Normalization – path resolution, default values, device handling, metric validation.
  • Deterministic interpretation – converting YAML → stable Python object.
  • Clear contract – DatasetAdapters, Runners, Metrics, and the UI all rely on the same structured config.

Example Pydantic Models

from pathlib import Path
from typing import Dict, List
from pydantic import BaseModel

class DatasetConfig(BaseModel):
    kind: str
    root: Path
    splits: Dict[str, str]

class EvalConfig(BaseModel):
    metrics: List[str]
    device: str = "auto"
    batch_size: int = 16

class AppConfig(BaseModel):
    task: str
    dataset: DatasetConfig
    eval: EvalConfig

A correct AppConfig guarantees predictable pipeline behavior; an incorrect YAML is caught immediately, before any runner starts executing.

3. Unifying Inconsistent Formats: DatasetAdapters

After validation, the next challenge is handling incompatible dataset formats. We introduced a modular DatasetAdapter layer that converts any dataset into a uniform iteration interface:

for image, target in adapter:
    # model inference
    ...

Available Adapters

  • ClassificationFolderAdapter
  • CocoDetectionAdapter
  • YoloDetectionAdapter
  • MaskSegmentationAdapter

Each adapter:

  • Reads the original dataset.
  • Converts annotations to a normalized structure.
  • Exposes consistent outputs across tasks.

This eliminates dozens of conditional branches and format‑specific parsing logic.

4. Task Runners: Executing Models Consistently Across Benchmarks

With datasets unified, we built three modular runners:

  • ClassifierRunner
  • DetectorRunner
  • SegmenterRunner

All runners share the same API:

result = runner.run(dataset, model, config)

Each runner handles:

  • Forward passes
  • Output normalization
  • Prediction logging
  • Metric computation
  • Artifact generation
  • Real‑time UI reporting

The design allows any model to run on any benchmark, provided the configuration matches.

5. From Script to System: Client–Server Architecture

To support multiple users and parallel evaluations, the project evolved into a full client–server system.

Server responsibilities

  • Job scheduling and queue management
  • Load balancing across workers
  • Artifact storage (e.g., MinIO)
  • Model/version tracking
  • Failure isolation

Client (PyQt) responsibilities

  • Uploading models
  • Selecting benchmarks and configuring runs
  • Viewing real‑time logs
  • Comparing metrics across runs
  • Downloading prediction artifacts

This architecture transformed the pipeline into a usable, scalable research tool.

6. Key Engineering Lessons Learned

  • Configuration should drive execution, not the other way around.
  • Strong validation (Pydantic) saves hours of debugging.
  • Adapters normalize complexity and prevent format‑specific logic explosion.
  • Modular runners make task logic replaceable and easy to extend.
  • Incremental evaluation is essential for real‑world datasets.
  • Client–server separation turns a pipeline into a production‑grade system.

Conclusion

By combining:

  • Declarative YAML configuration
  • A strongly typed AppConfig layer
  • Dataset normalization through adapters
  • Modular runners
  • Incremental computation
  • A client–server architecture

we built a unified benchmarking pipeline capable of running any computer‑vision model on any benchmark—without writing new code for each task. This approach provides stability, extensibility, and reproducibility—essential qualities for a real‑world evaluation system.

Back to Blog

Related posts

Read more »