How I Built a Self-Correcting ML Pipeline That Runs for Days Without Human Intervention
Source: Dev.to
Overview
Most ML pipelines are batch jobs: data in, predictions out, human reviews results. But what if you need a pipeline that runs autonomously for days, continuously ingesting data from external sources, making decisions, and—critically—correcting its own mistakes without anyone watching?
That’s what I built for AstroLens, an open‑source tool for detecting astronomical anomalies in sky‑survey images. Version 1.1.0 introduces Streaming Discovery: a mode where the system runs for days, downloads tens of thousands of images from sky surveys, analyzes each one with an ML ensemble, and generates daily reports—while continuously adjusting its own parameters.
In a 3‑day validation run it:
- Processed 20,997 images
- Flagged 3,458 anomaly candidates
- Independently found known supernovae and gravitational lenses
- Ran 140 self‑correction cycles with zero errors and zero human intervention
Below is a walk‑through of the self‑correcting architecture.
The Core Pipeline
┌─────────────────────────────────────────────────────┐
│ STREAMING DISCOVERY │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Survey │──▶│ ML │──▶│ Catalog ││
│ │ Ingestor │ │ Ensemble │ │ Cross‑reference ││
│ └─────┬────┘ └────┬─────┘ └────────┬─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────┐│
│ │ State Manager (aiosqlite) ││
│ └──────────────────────┬──────────────────────────┘│
│ │ │
│ ┌──────────────────────▼──────────────────────────┐│
│ │ Self‑Correction Controller ││
│ │ • Threshold decay • Source rebalancing ││
│ │ • OOD recalibration • Error recovery ││
│ │ • YOLO auto‑retrain • Health monitoring ││
│ └───────────────────────────────────────────────┘│
│ │ │
│ ┌────▼────┐ │
│ │ Reports │ │
│ └─────────┘ │
└─────────────────────────────────────────────────────┘
The ML ensemble itself has three stages:
- Vision Transformer (ViT‑B/16) – feature extraction
- OOD detection ensemble – Mahalanobis + energy‑based + k‑NN with majority voting
- YOLOv8 – transient localization
The architecture decisions that matter most for autonomous operation aren’t in the models—they’re in everything around them.
Self‑Correction #1 – Adaptive Threshold Decay
Running OOD detection for days leads to threshold drift. Initial thresholds are calibrated on a small reference set, but as thousands of images are ingested the notion of “normal” evolves.
The pipeline uses a decay function that gradually adjusts OOD thresholds based on the accumulated distribution of scores:
threshold_adjustment = base_threshold * (1 - decay_rate * log(1 + images_processed))
- Early in a run thresholds are tight—better to miss an anomaly than flood the system with false positives while the reference distribution is thin.
- Later thresholds relax proportionally to the logarithm of processed images. The logarithmic curve gives rapid early adjustment and diminishing change as the distribution stabilises.
The system also tracks the running mean and variance of OOD scores per source. If a source’s score distribution shifts significantly (measured by a simple KL‑divergence check against the last calibration window), a recalibration is triggered.
Self‑Correction #2 – Source Rebalancing
AstroLens downloads from multiple sky surveys: SDSS, ZTF, DECaLS, Pan‑STARRS, Hubble, Galaxy Zoo, and specialised catalogs. Not all sources yield anomalies equally.
During the 3‑day run:
| Source | Anomaly Rate |
|---|---|
| ZTF (transient survey) | 32.5 % |
| Gravitational‑lens catalogs | 60 % |
| General‑purpose surveys | much lower |
The pipeline tracks anomaly yield per source over a rolling window and rebalances query allocation proportionally. Sources that produce more interesting candidates receive more queries; sources that mostly return normal galaxies receive fewer.
# Simplified rebalancing logic
def compute_source_weights(source_stats, window_size=500):
weights = {}
for source, stats in source_stats.items():
recent = stats.last_n(window_size)
anomaly_rate = recent.anomaly_count / max(recent.total_count, 1)
error_rate = recent.error_count / max(recent.total_count, 1)
weights[source] = anomaly_rate * (1 - error_rate)
total = sum(weights.values()) or 1
return {s: w / total for s, w in weights.items()}
A floor prevents any source from being starved completely—maintaining diversity in the input avoids feedback loops.
Self‑Correction #3 – YOLO Auto‑Retrain
This yielded the most dramatic improvement. The YOLOv8 transient‑detection model started the 3‑day run at 51.5 % mAP₅₀ and finished at 99.5 %.
How it works
- As images are processed, high‑confidence OOD candidates that successfully cross‑reference with catalogs become pseudo‑labeled training data.
- When enough new labeled samples accumulate (configurable threshold), the pipeline triggers a YOLO fine‑tuning cycle on the expanded dataset.
The result is a continuously improving detector that adapts to the specific data distribution it sees in the wild.
Takeaways
- Adaptive thresholds keep OOD detection calibrated as the data distribution evolves.
- Source rebalancing concentrates effort on productive surveys while preserving input diversity.
- Closed‑loop model retraining (YOLO auto‑retrain) turns the pipeline into a self‑learning system, dramatically boosting performance without human intervention.
Together, these mechanisms enable a truly autonomous, self‑correcting ML pipeline that can run for days, ingesting terabytes of astronomical imagery, discovering rare events, and improving itself on the fly.
Self‑correction #4: Error recovery
External APIs fail. Survey servers rate‑limit you. Images download corrupted. Over a 3‑day run, things go wrong constantly.
The error‑recovery system operates at multiple levels:
Request level – Exponential back‑off with jitter on transient HTTP failures. Configurable retry limits per source.
Image level – Corrupted or unparseable images are logged and skipped. If a source’s error rate exceeds a threshold, it’s temporarily deprioritized (feeds into source rebalancing).
Pipeline level – The state manager persists every pipeline decision to aiosqlite. If the process crashes — power failure, OOM, whatever — it restarts and picks up from the last committed state. No image is processed twice, no candidate is lost.
Session level – Each streaming session tracks cumulative health metrics. If aggregate error rates or processing latency breach configurable bounds, the pipeline can pause, generate an interim report, and alert.
During the 3‑day run, 140 self‑correction events were triggered across these systems. All were handled autonomously.
Daily reporting
At configurable intervals (default: 24 h), the pipeline generates a Markdown report containing:
- Total images processed, anomaly candidates found, unique sky regions covered
- Per‑source breakdown: anomaly rates, error rates, top candidates
- YOLO training status and performance trajectory
- Threshold‑adjustment history
- Pipeline health: uptime, memory usage, processing‑rate trends
- Highlighted candidates with coordinates, survey thumbnails, and catalog cross‑matches
These reports serve as both audit logs and scientific outputs.
What went wrong (and how the system handled it)
Survey API rate limits – ZTF and SDSS both throttled requests during peak hours. The back‑off system handled this gracefully, but processing throughput varied by time of day. The pipeline adapted by shifting more queries to less‑loaded sources during peak hours.
YOLO training on small batches – Early retrain cycles had very few samples, risking over‑fitting. The minimum‑sample threshold prevents this from happening too early, but there’s a tension between waiting for enough data and getting improved detection sooner. The current heuristic works but could be more sophisticated.
OOD threshold sensitivity – The initial thresholds matter more than expected. If they’re too tight at the start, you miss anomalies that would have improved downstream training data. If too loose, early YOLO training data is noisy. The logarithmic decay is a compromise, but the cold‑start problem is real.
The Most Significant Detections
Found images are published here.
Every detection was cross‑referenced against SIMBAD, the international astronomical reference database maintained by the Centre de Données astronomiques de Strasbourg. All 269 queried positions returned matches to known catalogued objects, validating that the pipeline is looking at real astronomical sources.
Key takeaways
-
State persistence is non‑negotiable for long‑running pipelines. Every decision, intermediate result, and parameter change must be recoverable.
aiosqlitewith WAL mode handles this well for single‑node operation. -
Self‑correction beats monitoring. Instead of alerting a human when thresholds drift, let the system adjust. Save human attention for genuinely novel decisions.
-
Majority voting across diverse methods is more robust than any single OOD detection approach. Mahalanobis, energy‑based, and k‑NN each fail differently — their intersection is much more reliable.
-
Logarithmic schedules appear everywhere in this system. Threshold decay, rebalancing windows, retrain triggers — diminishing adjustment as data accumulates is a recurring pattern.
-
Test with known objects. The pipeline’s ability to independently find SN 2014J, NGC 3690, and SDSS J0252+0039 is validation, not discovery. Building this kind of ground truth into autonomous systems is essential for trust.
Try it
AstroLens is MIT‑licensed and runs on any laptop (CPU, Apple Silicon, or NVIDIA GPU). Desktop app, web interface, and CLI.
git clone https://github.com/samantaba/astroLens.git
cd astroLens
pip install -r requirements.txt
python main.py --mode web
Enter fullscreen mode
Exit fullscreen mode
GitHub:
Contributions welcome — especially around new survey integrations, alternative OOD methods, and visualization improvements. If you’ve built long‑running ML pipelines, I’d love to hear what patterns worked for you.
Built by Saman Tabatabaeian — AWS Solutions Architect, Cloud & DevOps, ML Engineer.