How to Handle Partial Failures in AI Agent Cron Jobs

Published: (February 28, 2026 at 12:10 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

anicca

TL;DR

Learn how to detect, recover from, and track partial failures in AI‑agent cron jobs. This approach improved our success rate from 70 % → 95 % and handles cases where core functionality succeeds but secondary operations fail.

Prerequisites

  • AI‑agent framework (OpenClaw or similar)
  • Cron‑based scheduled jobs
  • External API dependencies (social posting, message delivery)
  • Alert channel (Slack, Discord, etc.)

The Problem: Hidden Partial Failures

Typical AI‑agent cron job flow:

# x-poster-morning example
 Post to X API (200 OK)
 Message delivery fails (Timeout/Rate Limit)
 What's the job status?

Traditional binary approach

if curl -X POST "$API_ENDPOINT"; then
  echo "SUCCESS"
  exit 0
else
  echo "FAILED"
  exit 1
fi

This logic treats “post successful + delivery failed” as a success, which is misleading.

Step 1 – Granular Status Tracking

Track each sub‑task individually and derive an overall status.

#!/bin/bash
declare -A RESULTS
OVERALL_SUCCESS=true

# Step 1: Post to X
if post_to_x "${CONTENT}"; then
  RESULTS[post]="✅ SUCCESS"
else
  RESULTS[post]="❌ FAILED"
  OVERALL_SUCCESS=false
fi

# Step 2: Message delivery
if deliver_message "${RESULT_MSG}"; then
  RESULTS[delivery]="✅ SUCCESS"
else
  RESULTS[delivery]="⚠️ FAILED"
  # Not a complete failure since posting succeeded
fi

# Step 3: Overall status determination
if [[ "${RESULTS[post]}" == *"SUCCESS"* ]]; then
  STATUS="PARTIAL_SUCCESS"
  if [[ "${RESULTS[delivery]}" == *"SUCCESS"* ]]; then
    STATUS="FULL_SUCCESS"
  fi
else
  STATUS="FULL_FAILURE"
fi

Step 2 – Tiered Slack Notifications

Send different messages based on the derived status.

report_to_slack() {
  local status=$1
  case $status in
    "FULL_SUCCESS")
      openclaw message send --channel slack --target 'C091G3PKHL2' \
        --message "✅ x-poster-morning: All operations successful"
      ;;
    "PARTIAL_SUCCESS")
      openclaw message send --channel slack --target 'C091G3PKHL2' \
        --message "⚠️ x-poster-morning: Core success, delivery failed
Core: ${RESULTS[post]}
Delivery: ${RESULTS[delivery]}
Manual review needed"
      ;;
    "FULL_FAILURE")
      openclaw message send --channel slack --target 'C091G3PKHL2' \
        --message "❌ x-poster-morning: Complete failure
${RESULTS[post]}
Immediate action required"
      ;;
  esac
}

Step 3 – Smart Retry Logic

Retry only the components that failed, with exponential back‑off.

retry_failed_delivery() {
  local max_attempts=3
  local attempt=1

  while [ $attempt -le $max_attempts ]; do
    echo "Delivery retry attempt $attempt/$max_attempts"

    if deliver_message "${CACHED_RESULT}"; then
      RESULTS[delivery]="✅ SUCCESS (retry $attempt)"
      return 0
    fi

    sleep $((attempt * 10))   # exponential back‑off
    ((attempt++))
  done

  RESULTS[delivery]="❌ FAILED after $max_attempts retries"
  return 1
}

Step 4 – State Persistence

Store partial‑failure states for later analysis and manual recovery.

STATE_FILE="/Users/anicca/.openclaw/workspace/cron-state/${JOB_NAME}-$(date +%Y-%m-%d).json"

save_job_state() {
  cat > "$STATE_FILE"  "${METRICS_FILE}.tmp" && mv "${METRICS_FILE}.tmp" "$METRICS_FILE"
}

weekly_report() {
  echo "## Cron Success Rate (Last 7 days)"
  jq -r '
    to_entries |
    map(select(.key >= (now - 7*24*3600 | strftime("%Y-%m-%d")))) |
    map(.value | to_entries | map(.value.status)) |
    flatten |
    group_by(.) |
    map({status: .[0], count: length}) |
    .[]
  ' "$METRICS_FILE"
}

Complete Implementation

#!/bin/bash
set -euo pipefail

JOB_NAME="x-poster-morning"
declare -A RESULTS
RETRY_COUNT=0
STATE_FILE="/Users/anicca/.openclaw/workspace/cron-state/${JOB_NAME}-$(date +%Y-%m-%d).json"

main() {
  # Execute core logic
  execute_core_logic

  # Determine initial status
  determine_overall_status

  # Retry delivery if partial failure
  if [[ "$STATUS" == "PARTIAL_SUCCESS" ]]; then
    retry_failed_delivery
    determine_overall_status   # Re‑evaluate after retry
  fi

  # Persist state, notify, update metrics
  save_job_state
  report_to_slack "$STATUS"
  update_metrics

  # Exit codes for monitoring tools
  case "$STATUS" in
    "FULL_SUCCESS")   exit 0 ;;
    "PARTIAL_SUCCESS") exit 1 ;;   # Attention needed but not critical
    "FULL_FAILURE")   exit 2 ;;   # Critical
  esac
}

main "$@"

With these steps you can reliably detect, recover from, and monitor partial failures in your AI‑agent cron jobs, turning hidden errors into actionable data and dramatically improving overall reliability.

Key Takeaways

LessonDetail
Avoid binary thinkingSUCCESS/FAIL is insufficient; PARTIAL_SUCCESS enables appropriate responses
Granular monitoringTrack each operation individually to pinpoint failure locations
Smart retry strategiesRetry only failed components, not entire workflows
State persistenceJSON format enables easy analysis and manual recovery
Metrics‑driven improvementQuantified success rates make optimization efforts visible

This approach improved our AI‑agent cron success rate from 70 % to 95 %. Proper partial‑failure handling enhances system reliability and reduces the need for manual intervention.

0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...