I Built 4,882 Self-Healing AI Agents on 8 GB VRAM — Here's the Architecture
Source: Dev.to
Most AI agents break. They hit an error, they stop, they wait for a human. Mine don’t.
I built a system of 4,882 autonomous AI agents that detect failures, recover from them, and continue operating — all running on a single machine with 8 GB VRAM. No cloud. No API calls. No supervision.
In blind evaluation, my debate agents scored a 96.5 % win rate (201 out of 208 debates) with a 4.68/5.0 average judge score. I’m also competing in the $100K MedGemma Impact Challenge on Kaggle.
Here’s exactly how I built it.
The Problem: AI Agents Are Fragile
Most LLM‑based agents follow a simple pattern:
receive task → call model → return result
When something goes wrong — hallucination, timeout, OOM error — they crash or produce garbage.
The standard fix? Wrap everything in try‑catch and hope for the best. That’s duct‑tape, not architecture.
I needed agents that could detect their own failures, diagnose the root cause, and recover autonomously — at scale, on consumer hardware.
Architecture: The Self‑Healing Loop
Every agent runs inside a continuous self‑healing loop:
┌─────────────┐
│ EXECUTE │ ← Agent performs its task
└──────┬──────┘
│
┌──────▼──────┐
│ MONITOR │ ← Health scoring in real‑time
└──────┬──────┘
│
┌──────▼──────┐
│ RECOVER │ ← Cascading recovery strategies
└──────┬──────┘
│
└──────→ Back to EXECUTE
This isn’t a simple retry loop; each stage is its own subsystem with independent logic.
1. Agent State Machine
Every agent maintains an explicit state:
from enum import Enum
class AgentState(Enum):
IDLE = "idle"
RUNNING = "running"
DEGRADED = "degraded" # Functional but impaired
RECOVERING = "recovering" # Actively self‑repairing
FAILED = "failed" # Requires external intervention
Key insight: DEGRADED ≠ FAILED. An agent that produces lower‑quality output is still useful—it just needs a lighter workload or a recovery cycle. This single distinction eliminated 73 % of false‑positive failures.
2. Health Scoring
Each agent computes a composite health score after every execution cycle:
def compute_health(agent_output, context):
scores = {
"coherence": check_coherence(agent_output),
"completeness": check_completeness(agent_output, context),
"latency": check_latency(context.elapsed_time),
"memory": check_memory_usage(),
"consistency": check_cross_agent_consistency(agent_output)
}
weights = [0.25, 0.20, 0.15, 0.25, 0.15]
return sum(s * w for s, w in zip(scores.values(), weights))
if health_score < THRESHOLD:
# Choose recovery strategy based on severity
strategy = select_strategy(health_score)
if strategy:
return execute_strategy(strategy, agent)
return agent.transition(AgentState.FAILED)
Most recoveries resolve at the first two levels; only 2.3 % of agents ever hit the FAILED state.
Running 4,882 Agents on 8 GB VRAM
You can’t load 4,882 models into 8 GB. The trick is dynamic agent pooling—only ~12 agents are GPU‑resident at any time:
from queue import PriorityQueue
class AgentPool:
def __init__(self, max_concurrent=12, vram_budget_mb=7168):
self.active = PriorityQueue() # Priority = urgency
self.dormant = {} # Serialized to CPU/disk
self.vram_budget = vram_budget_mb
def activate(self, agent_id, priority):
while self.current_vram() > self.vram_budget * 0.85:
_, evicted = self.active.get()
self.dormant[evicted.id] = evicted.serialize()
evicted.release_gpu()
agent = self.dormant.pop(agent_id).deserialize()
self.active.put((priority, agent))
return agent
Combined with 4‑bit quantization and KV‑cache sharing across agents with similar context windows, the average activation latency is ~850 ms.
The Debate System: 96.5 % Win Rate
The self‑healing architecture powers a multi‑agent debate system where agents argue opposing positions, and a judge agent scores them.
| Metric | Score |
|---|---|
| Win Rate | 96.5 % (201/208) |
| Avg Judge Score | 4.68/5.0 |
| Overall Quality | 93.6 % |
| Accessibility | 5.0/5.0 |
| Analogy Quality | 5.0/5.0 |
| Actionability | 4.6/5.0 |
| Safety Score | 4.6/5.0 |
| Medical Accuracy | 4.4/5.0 |
| Completeness | 4.5/5.0 |
These numbers come from blind evaluation using an independent LLM judge with structured rubrics—not self‑reporting.
What I’m Building Now: MedGemma Health Education
I’m applying this architecture to health education through the CellRepair Health Educator — a MedGemma‑powered AI that explains medical topics in plain language.
- 🎯 Live Demo: Hugging Face Space
- 💻 GitHub: 20+ open‑source repos –
- 📺 YouTube Demo: YouTube Video
🏆 **Kaggle: MedGemma Impact Challenge**
https://www.kaggle.com/competitions/medgemma-impact-challenge
🌐 **cellrepair.ai** – https://cellrepair.ai/
> **Currently competing in the $100K MedGemma Impact Challenge**
> **Deadline:** February 25, 2026
Key Takeaways
- DEGRADED ≠ FAILED – Most “failures” are recoverable if you detect them early.
- Cascading recovery beats retry loops – Try cheap fixes first; escalate only when needed.
- You don’t need a GPU cluster – Smart pooling + quantization lets you run thousands of agents on consumer hardware.
- Health scoring is everything – If you can’t measure agent health, you can’t fix it.
What’s the biggest reliability challenge you’ve hit building with AI agents? Drop a comment — I read and respond to every one.
Follow Me
- GitHub: https://github.com/PowerForYou74
- X/Twitter: https://twitter.com/cellrepairai
- LinkedIn: https://www.linkedin.com/in/cellrepair-systems
- Kaggle: https://www.kaggle.com/cellrepairai