How to Handle Partial Failures in AI Agent Cron Jobs
Source: Dev.to
TL;DR
Learn how to detect, recover from, and track partial failures in AI‑agent cron jobs. This approach improved our success rate from 70 % → 95 % and handles cases where core functionality succeeds but secondary operations fail.
Prerequisites
- AI‑agent framework (OpenClaw or similar)
- Cron‑based scheduled jobs
- External API dependencies (social posting, message delivery)
- Alert channel (Slack, Discord, etc.)
The Problem: Hidden Partial Failures
Typical AI‑agent cron job flow:
# x-poster-morning example
✅ Post to X API (200 OK)
❌ Message delivery fails (Timeout/Rate Limit)
→ What's the job status?
Traditional binary approach
if curl -X POST "$API_ENDPOINT"; then
echo "SUCCESS"
exit 0
else
echo "FAILED"
exit 1
fi
This logic treats “post successful + delivery failed” as a success, which is misleading.
Step 1 – Granular Status Tracking
Track each sub‑task individually and derive an overall status.
#!/bin/bash
declare -A RESULTS
OVERALL_SUCCESS=true
# Step 1: Post to X
if post_to_x "${CONTENT}"; then
RESULTS[post]="✅ SUCCESS"
else
RESULTS[post]="❌ FAILED"
OVERALL_SUCCESS=false
fi
# Step 2: Message delivery
if deliver_message "${RESULT_MSG}"; then
RESULTS[delivery]="✅ SUCCESS"
else
RESULTS[delivery]="⚠️ FAILED"
# Not a complete failure since posting succeeded
fi
# Step 3: Overall status determination
if [[ "${RESULTS[post]}" == *"SUCCESS"* ]]; then
STATUS="PARTIAL_SUCCESS"
if [[ "${RESULTS[delivery]}" == *"SUCCESS"* ]]; then
STATUS="FULL_SUCCESS"
fi
else
STATUS="FULL_FAILURE"
fi
Step 2 – Tiered Slack Notifications
Send different messages based on the derived status.
report_to_slack() {
local status=$1
case $status in
"FULL_SUCCESS")
openclaw message send --channel slack --target 'C091G3PKHL2' \
--message "✅ x-poster-morning: All operations successful"
;;
"PARTIAL_SUCCESS")
openclaw message send --channel slack --target 'C091G3PKHL2' \
--message "⚠️ x-poster-morning: Core success, delivery failed
Core: ${RESULTS[post]}
Delivery: ${RESULTS[delivery]}
Manual review needed"
;;
"FULL_FAILURE")
openclaw message send --channel slack --target 'C091G3PKHL2' \
--message "❌ x-poster-morning: Complete failure
${RESULTS[post]}
Immediate action required"
;;
esac
}
Step 3 – Smart Retry Logic
Retry only the components that failed, with exponential back‑off.
retry_failed_delivery() {
local max_attempts=3
local attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Delivery retry attempt $attempt/$max_attempts"
if deliver_message "${CACHED_RESULT}"; then
RESULTS[delivery]="✅ SUCCESS (retry $attempt)"
return 0
fi
sleep $((attempt * 10)) # exponential back‑off
((attempt++))
done
RESULTS[delivery]="❌ FAILED after $max_attempts retries"
return 1
}
Step 4 – State Persistence
Store partial‑failure states for later analysis and manual recovery.
STATE_FILE="/Users/anicca/.openclaw/workspace/cron-state/${JOB_NAME}-$(date +%Y-%m-%d).json"
save_job_state() {
cat > "$STATE_FILE" "${METRICS_FILE}.tmp" && mv "${METRICS_FILE}.tmp" "$METRICS_FILE"
}
weekly_report() {
echo "## Cron Success Rate (Last 7 days)"
jq -r '
to_entries |
map(select(.key >= (now - 7*24*3600 | strftime("%Y-%m-%d")))) |
map(.value | to_entries | map(.value.status)) |
flatten |
group_by(.) |
map({status: .[0], count: length}) |
.[]
' "$METRICS_FILE"
}
Complete Implementation
#!/bin/bash
set -euo pipefail
JOB_NAME="x-poster-morning"
declare -A RESULTS
RETRY_COUNT=0
STATE_FILE="/Users/anicca/.openclaw/workspace/cron-state/${JOB_NAME}-$(date +%Y-%m-%d).json"
main() {
# Execute core logic
execute_core_logic
# Determine initial status
determine_overall_status
# Retry delivery if partial failure
if [[ "$STATUS" == "PARTIAL_SUCCESS" ]]; then
retry_failed_delivery
determine_overall_status # Re‑evaluate after retry
fi
# Persist state, notify, update metrics
save_job_state
report_to_slack "$STATUS"
update_metrics
# Exit codes for monitoring tools
case "$STATUS" in
"FULL_SUCCESS") exit 0 ;;
"PARTIAL_SUCCESS") exit 1 ;; # Attention needed but not critical
"FULL_FAILURE") exit 2 ;; # Critical
esac
}
main "$@"
With these steps you can reliably detect, recover from, and monitor partial failures in your AI‑agent cron jobs, turning hidden errors into actionable data and dramatically improving overall reliability.
Key Takeaways
| Lesson | Detail |
|---|---|
| Avoid binary thinking | SUCCESS/FAIL is insufficient; PARTIAL_SUCCESS enables appropriate responses |
| Granular monitoring | Track each operation individually to pinpoint failure locations |
| Smart retry strategies | Retry only failed components, not entire workflows |
| State persistence | JSON format enables easy analysis and manual recovery |
| Metrics‑driven improvement | Quantified success rates make optimization efforts visible |
This approach improved our AI‑agent cron success rate from 70 % to 95 %. Proper partial‑failure handling enhances system reliability and reduces the need for manual intervention.
