如何构建在网关故障下仍能存活的弹性分布式 AI Agent 系统

发布: 3天前 (2026年3月1日 GMT+8 01:09)

4 分钟阅读

Source: Dev.to

TL;DR

实现了一种分布式 AI‑agent 设计，即使网关出现 WebSocket 或网络故障，技能仍能持续运行。会话管理与执行基础设施解耦，确保自动（基于 cron 的）技能实现 100 % 的正常运行时间。

架构概览

OpenClaw 由三个独立组件组成：

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│  Session Layer  │    │  Gateway Core    │    │  Skill Runtime  │
│  (WebSocket)    │◄──►│  (HTTP/REST)     │◄──►│  (File/Process)│
└─────────────────┘    └──────────────────┘    └─────────────────┘
      ↕ ERROR                  ↕ OK                 ↕ OK
  History/State            API Calls         Skill Execution/FileIO

Session Layer – 处理 WebSocket 连接并维护会话历史。
Gateway Core – 提供用于技能编排的 HTTP/REST API。
Skill Runtime – 将技能作为文件式进程执行，独立于会话层。

关键洞察： WebSocket 故障仅影响会话层；核心层和运行时仍可继续工作。

代码示例

❌ 错误示例：所有逻辑依赖于网关

// runSkill blocks if the gateway is unavailable
async function runSkill() {
  const session = await gateway.getSession(); // Failure stops skill
  const result = await executeSkill(session);
  await gateway.updateStatus(result);
}

✅ 正确示例：执行独立于会话

// Skill execution proceeds even if session updates fail
async function runSkillIndependent() {
  // Skill execution is independent (file‑based)
  const result = await executeSkillFromFile();

  // Session updates are best‑effort
  try {
    await gateway.updateStatus(result);
  } catch (error) {
    console.log('Session update failed, but skill succeeded');
  }
}

持久化技能状态（Bash）

# Store status of the daily‑memory skill
echo "status=success,timestamp=$(date)" > ~/.openclaw/skills/status/daily-memory.txt

# Store status of the TikTok poster skill
echo "posts=4,account=en,last_run=$(date)" > ~/.openclaw/skills/status/tiktok-poster.txt

健康检查脚本（Bash）

#!/bin/bash
GATEWAY_STATUS=$(curl -s http://localhost:3019/health || echo "FAIL")
SKILL_STATUS=$(find ~/.openclaw/skills/status -name "*.txt" -mmin -60 | wc -l)

if [[ "$GATEWAY_STATUS" == "FAIL" && "$SKILL_STATUS" -gt 0 ]]; then
  echo "⚠️ Gateway down but skills running - Graceful degradation mode"
else
  echo "✅ All systems operational"
fi

Failure Scenarios

场景	会话层	网关核心	技能运行时	结果
WebSocket 断开	❌ Failed	✅ OK	✅ OK	技能继续
网络分区	❌ Failed	❌ Failed	✅ OK	仅本地技能运行
进程重启	❌ Paused	❌ Paused	✅ OK	Cron 自动恢复

指标（在 Mac Mini / VPS 上观察）

正常运行时间： 100 %（无技能停机）
网关连接问题： 3 次
技能执行成功率： 100 %（包括失败期间）
自动恢复时间： 平均 30 秒

经验教训

关注点分离

将会话管理与业务逻辑分离，以防止级联故障。

基于文件的状态

将状态持久化到文件系统，避免网络依赖并实现快速恢复。

优雅降级

与其完全停机，不如提供受限功能，以保持更好的用户体验。

分布式监控

在每一层部署独立的健康检查，以避免单点故障。

在分布式系统中，“故障时的部分功能”胜过“全有或全无”。OpenClaw 的设计表明，弹性模式对任何 AI 代理架构都是实用且有效的。