How I built an AI tool that diagnoses CI/CD pipeline failures in seconds
Source: Dev.to
What it does
When a CI/CD pipeline fails, PipelineIQ automatically:
- Captures the error logs
- Sends them to Claude AI for analysis
- Delivers a Slack alert with the exact root cause and fix steps — within seconds
Example Slack alert
🔴 Pipeline Failure: Stripe API connection timeout blocking payment webhooks
AI Diagnosis: The deployment is failing because the application cannot establish a connection to Stripe's API within the 30‑second timeout limit. This is preventing payment webhook processing.
Recommended Fix: Check STRIPE_SECRET_KEY and STRIPE_PUBLISHABLE_KEY in production environment variables. Test connectivity to api.stripe.com from your deployment environment. Increase API timeout from 30s to 60s.
No log diving. No guessing. Specific, actionable steps.
The stack
- FastAPI – Python backend with async support
- Supabase – PostgreSQL database with Row Level Security
- Anthropic Claude API – AI diagnosis engine
- Slack API – Rich block‑based alerts
- Railway – Production deployment
- GitHub Actions – Integration via one workflow step
How the integration works
Add a single step to any existing GitHub Actions workflow:
- name: Notify PipelineIQ
if: always()
run: |
curl -X POST $PIPELINEIQ_URL/api/v1/pipelines/runs \
-H "X-PipelineIQ-Key: $PIPELINEIQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"repo_full_name": "${{ github.repository }}",
"branch": "${{ github.ref_name }}",
"commit_sha": "${{ github.sha }}",
"commit_message": "${{ github.event.head_commit.message }}",
"workflow_name": "${{ github.workflow }}",
"status": "${{ job.status }}",
"started_at": "${{ github.event.head_commit.timestamp }}"
}'
Every run—success or failure—is stored. Failures automatically trigger AI diagnosis.
The AI diagnosis engine
A FastAPI background task fires Claude when a failure is stored:
async def run_ai_diagnosis(run: dict, org_id: str, supabase: Client):
insight = await diagnose_from_run(run)
if not insight:
return
supabase.table("insights").insert({
"severity": insight.get("severity"),
"title": insight.get("title"),
"diagnosis": insight.get("diagnosis"),
"recommendation": insight.get("recommendation"),
"estimated_time_save_minutes": insight.get("estimated_time_save_minutes"),
"confidence": insight.get("confidence"),
}).execute()
await send_pipeline_alert(insight, run)
Claude returns structured JSON with severity, diagnosis, recommendation, confidence score, and estimated time saved. The whole process runs in under 5 seconds.
What’s next
- Web dashboard with pipeline health across all repos
- DORA metrics (deployment frequency, change failure rate, recovery time)
- Environment drift detection
- Industry benchmarks — how does your team compare?
Try it free
PipelineIQ is in free beta. I’m looking for engineering teams to try it and give honest feedback on what’s missing.
- Website: https://pipelineiq.dev
- API docs: https://pipelineiq-production-3496.up.railway.app/docs
Happy to answer questions in the comments—especially from DevOps engineers who deal with pipeline failures daily.