Building a Unified Benchmarking Pipeline for Computer Vision — Without Rewriting Code for Every Task
Source: Dev.to
Problem Overview
| Task | Data Format | Output | Metrics |
|---|---|---|---|
| Classification | Folder structure | Label index | Accuracy, F1 |
| Detection | COCO / YOLO JSON / TXT | Bounding boxes | mAP |
| Segmentation | PNG masks | Pixel‑level mask | IoU |
The initial state featured:
- Non‑uniform pipelines
- Model‑specific scripts
- Benchmark‑specific code paths
- No consistent evaluation flow
To build a real benchmarking platform—not just a collection of scripts—we needed a unified execution model.
1. A Declarative Approach: One YAML Defines the Entire Benchmark
The first architectural decision was to replace hard‑coded logic with a declarative configuration model. Each benchmark is defined by a single YAML file that specifies:
- Task type
- Dataset format and paths
- Splits (train/val/test)
- Evaluation metrics
- Runtime parameters (device, batch size, etc.)
Example YAML
task: detection
dataset:
kind: coco
root: datasets/fruit
splits:
val: val
eval:
metrics: ["map50", "map"]
device: auto
Why this matters
- The YAML becomes the single source of truth for the entire system.
- Adding a new benchmark only requires creating a new YAML file—no code changes, no new scripts, no duplicated logic.
- This design directly enables extensibility.
2. YAML Alone Isn’t Enough — Enter Pydantic AppConfig
YAML is flexible but fragile; a typo or missing field can break an evaluation. To enforce correctness, we built a strongly‑typed AppConfig layer using Pydantic models.
Features of AppConfig
- Deep validation – types, allowed values, required fields, structural consistency.
- Normalization – path resolution, default values, device handling, metric validation.
- Deterministic interpretation – converting YAML → stable Python object.
- Clear contract – DatasetAdapters, Runners, Metrics, and the UI all rely on the same structured config.
Example Pydantic Models
from pathlib import Path
from typing import Dict, List
from pydantic import BaseModel
class DatasetConfig(BaseModel):
kind: str
root: Path
splits: Dict[str, str]
class EvalConfig(BaseModel):
metrics: List[str]
device: str = "auto"
batch_size: int = 16
class AppConfig(BaseModel):
task: str
dataset: DatasetConfig
eval: EvalConfig
A correct AppConfig guarantees predictable pipeline behavior; an incorrect YAML is caught immediately, before any runner starts executing.
3. Unifying Inconsistent Formats: DatasetAdapters
After validation, the next challenge is handling incompatible dataset formats. We introduced a modular DatasetAdapter layer that converts any dataset into a uniform iteration interface:
for image, target in adapter:
# model inference
...
Available Adapters
ClassificationFolderAdapterCocoDetectionAdapterYoloDetectionAdapterMaskSegmentationAdapter
Each adapter:
- Reads the original dataset.
- Converts annotations to a normalized structure.
- Exposes consistent outputs across tasks.
This eliminates dozens of conditional branches and format‑specific parsing logic.
4. Task Runners: Executing Models Consistently Across Benchmarks
With datasets unified, we built three modular runners:
ClassifierRunnerDetectorRunnerSegmenterRunner
All runners share the same API:
result = runner.run(dataset, model, config)
Each runner handles:
- Forward passes
- Output normalization
- Prediction logging
- Metric computation
- Artifact generation
- Real‑time UI reporting
The design allows any model to run on any benchmark, provided the configuration matches.
5. From Script to System: Client–Server Architecture
To support multiple users and parallel evaluations, the project evolved into a full client–server system.
Server responsibilities
- Job scheduling and queue management
- Load balancing across workers
- Artifact storage (e.g., MinIO)
- Model/version tracking
- Failure isolation
Client (PyQt) responsibilities
- Uploading models
- Selecting benchmarks and configuring runs
- Viewing real‑time logs
- Comparing metrics across runs
- Downloading prediction artifacts
This architecture transformed the pipeline into a usable, scalable research tool.
6. Key Engineering Lessons Learned
- Configuration should drive execution, not the other way around.
- Strong validation (Pydantic) saves hours of debugging.
- Adapters normalize complexity and prevent format‑specific logic explosion.
- Modular runners make task logic replaceable and easy to extend.
- Incremental evaluation is essential for real‑world datasets.
- Client–server separation turns a pipeline into a production‑grade system.
Conclusion
By combining:
- Declarative YAML configuration
- A strongly typed
AppConfiglayer - Dataset normalization through adapters
- Modular runners
- Incremental computation
- A client–server architecture
we built a unified benchmarking pipeline capable of running any computer‑vision model on any benchmark—without writing new code for each task. This approach provides stability, extensibility, and reproducibility—essential qualities for a real‑world evaluation system.