使用 Nginx 自动故障转移构建蓝/绿部署

发布: 1周前 (2025年12月9日 GMT+8 23:54)

8 min read

Source: Dev.to

介绍

Blue/Green 部署让你运行两个相同的应用实例（Blue 和 Green），并在出现问题时瞬间将流量从活动实例切换到备用实例。在本指南中，我们使用 Nginx 作为流量调度器，使用一个小型 Node.js 服务提供两个池，并可选地使用一个 Python 监视器读取 Nginx 的 JSON 日志并向 Slack 发送警报，构建一个完整的基于容器的 Blue/Green 环境。

前提条件

项目	原因
Docker + Docker Compose	在本地运行服务，无需 Kubernetes
Node.js（用于构建应用）	编译演示服务
（可选）Slack webhook URL	接收故障切换警报
终端和文本编辑器	创建并编辑文件

项目结构

.
├─ app/
│  ├─ package.json
│  ├─ app.js
│  └─ Dockerfile
├─ nginx/
│  └─ nginx.conf.template
├─ watcher/
│  ├─ requirements.txt
│  └─ watcher.py
├─ docker-compose.yaml
└─ .env

1. Node.js 应用

`package.json`

{
  "name": "blue-green-app",
  "version": "1.0.0",
  "main": "app.js",
  "license": "MIT",
  "scripts": {
    "start": "node app.js"
  },
  "dependencies": {
    "express": "^4.18.2"
  }
}

`app.js`

const express = require('express');
const app = express();

const APP_POOL = process.env.APP_POOL || 'unknown';
const RELEASE_ID = process.env.RELEASE_ID || 'unknown';
const PORT = process.env.PORT || 3000;

let chaosMode = false;
let chaosType = 'error'; // 'error' or 'timeout'

// Add tracing headers
app.use((req, res, next) => {
  res.setHeader('X-App-Pool', APP_POOL);
  res.setHeader('X-Release-Id', RELEASE_ID);
  next();
});

app.get('/', (req, res) => {
  res.json({
    service: 'Blue/Green Demo',
    pool: APP_POOL,
    releaseId: RELEASE_ID,
    status: chaosMode ? 'chaos' : 'healthy',
    chaosMode,
    chaosType: chaosMode ? chaosType : null,
    timestamp: new Date().toISOString(),
    endpoints: {
      version: '/version',
      health: '/healthz',
      chaos: '/chaos/start, /chaos/stop'
    }
  });
});

app.get('/healthz', (req, res) => {
  res.status(200).json({ status: 'healthy', pool: APP_POOL });
});

app.get('/version', (req, res) => {
  if (chaosMode && chaosType === 'error')
    return res.status(500).json({ error: 'Chaos: server error' });
  if (chaosMode && chaosType === 'timeout')
    return; // simulate hang
  res.json({
    version: '1.0.0',
    pool: APP_POOL,
    releaseId: RELEASE_ID,
    timestamp: new Date().toISOString()
  });
});

app.post('/chaos/start', (req, res) => {
  const mode = req.query.mode || 'error';
  chaosMode = true;
  chaosType = mode;
  res.json({ message: 'Chaos started', mode, pool: APP_POOL });
});

app.post('/chaos/stop', (req, res) => {
  chaosMode = false;
  chaosType = 'error';
  res.json({ message: 'Chaos stopped', pool: APP_POOL });
});

app.listen(PORT, '0.0.0.0', () => {
  console.log(`App (${APP_POOL}) listening on ${PORT}`);
  console.log(`Release ID: ${RELEASE_ID}`);
});

该服务提供：

GET /healthz – Nginx 用的健康检查
GET /version – 返回版本信息；可通过 chaos 模式强制返回错误或超时
POST /chaos/start?mode=error|timeout – 启用故障模拟
POST /chaos/stop – 关闭 chaos

2. 两个池的 Docker 镜像

`Dockerfile`

FROM node:18-alpine
WORKDIR /app

# Install production dependencies
COPY package*.json ./
RUN npm install --only=production

# Copy source code
COPY . .

EXPOSE 3000
CMD ["npm", "start"]

Blue 和 Green 容器都基于此镜像构建；它们仅通过环境变量（APP_POOL、RELEASE_ID 等）来区分。

3. Nginx 流量调度器

`nginx/nginx.conf.template`

events {
    worker_connections 1024;
}

http {
    # Structured JSON access logs
    log_format custom_json '{"time":"$time_iso8601"'
                          ',"remote_addr":"$remote_addr"'
                          ',"method":"$request_method"'
                          ',"uri":"$request_uri"'
                          ',"status":$status'
                          ',"bytes_sent":$bytes_sent'
                          ',"request_time":$request_time'
                          ',"upstream_response_time":"$upstream_response_time"'
                          ',"upstream_status":"$upstream_status"'
                          ',"upstream_addr":"$upstream_addr"'
                          ',"pool":"$sent_http_x_app_pool"'
                          ',"release":"$sent_http_x_release_id"}';

    upstream blue_pool {
        server app-blue:3000 max_fails=1 fail_timeout=3s;
        server app-green:3000 backup;
    }

    upstream green_pool {
        server app-green:3000 max_fails=1 fail_timeout=3s;
        server app-blue:3000 backup;
    }

    server {
        listen 80;
        server_name localhost;

        # JSON access log (shared volume)
        access_log /var/log/nginx/access.json custom_json;

        # Simple health endpoint for the load balancer itself
        location /healthz {
            access_log off;
            return 200 "healthy\n";
            add_header Content-Type text/plain;
        }

        location / {
            # $UPSTREAM_POOL is set by Docker‑Compose env substitution
            proxy_pass http://$UPSTREAM_POOL;

            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            # Fast timeouts → quick failover
            proxy_connect_timeout 1s;
            proxy_send_timeout 3s;
            proxy_read_timeout 3s;

            # Retry on errors / timeouts, try backup upstream
            proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
            proxy_next_upstream_tries 2;
            proxy_next_upstream_timeout 10s;

            proxy_pass_request_headers on;
            proxy_hide_header X-Powered-By;
        }
    }
}

关键设置

max_fails=1 fail_timeout=3s – 单次失败即将上游标记为下线短时间。
较短的 proxy_*_timeout 值可防止客户端在主池异常时长时间等待。
proxy_next_upstream 配合重试会自动将请求路由到备份池。

4. 可选的 Slack 监视器

`watcher/requirements.txt`

requests==2.32.3

`watcher/watcher.py`

import json, os, time, requests
from collections import deque
from datetime import datetime, timezone

LOG_PATH = os.getenv("NGINX_LOG_FILE", "/var/log/nginx/access.json")
SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL", "")
SLACK_PREFIX = os.getenv("SLACK_PREFIX", "from: @Watcher")
ACTIVE_POOL = os.getenv("ACTIVE_POOL", "blue")
ERROR_RATE_THRESHOLD = float(os.getenv("ERROR_RATE_THRESHOLD", "2"))
WINDOW_SIZE = int(os.getenv("WINDOW_SIZE", "200"))
ALERT_COOLDOWN_SEC = int(os.getenv("ALERT_COOLDOWN_SEC", "300"))
MAINTENANCE_MODE = os.getenv("MAINTENANCE_MODE", "false").lower() == "true"

def now_iso():
    return datetime.now(timezone.utc).isoformat()

def post_to_slack(message):
    if not SLACK_WEBHOOK_URL:
        return
    payload = {"text": f"{SLACK_PREFIX} {message}"}
    try:
        requests.post(SLACK_WEBHOOK_URL, json=payload, timeout=5)
    except Exception as e:
        print(f"Slack post failed: {e}")

def parse_log_line(line):
    try:
        return json.loads(line)
    except json.JSONDecodeError:
        return None

def main():
    recent = deque(maxlen=WINDOW_SIZE)
    last_alert = 0

    while True:
        try:
            with open(LOG_PATH, "r") as f:
                # Seek to end and read new lines
                f.seek(0, os.SEEK_END)
                while True:
                    line = f.readline()
                    if not line:
                        time.sleep(0.5)
                        continue
                    entry = parse_log_line(line.strip())
                    if not entry:
                        continue
                    recent.append(entry)

                    # Detect failover: pool header changed from ACTIVE_POOL
                    if entry.get("pool") and entry["pool"] != ACTIVE_POOL:
                        now = time.time()
                        if now - last_alert > ALERT_COOLDOWN_SEC:
                            msg = f"Failover detected! Traffic switched from {ACTIVE_POOL} to {entry['pool']}"
                            post_to_slack(msg)
                            print(now_iso(), msg)
                            last_alert = now
        except FileNotFoundError:
            time.sleep(1)
        except Exception as e:
            print(f"Watcher error: {e}")
            time.sleep(2)

if __name__ == "__main__":
    main()

监视器会实时读取 JSON 日志，计算简单的错误率窗口，并在检测到流量切换到非活动池或错误率超过阈值时向 Slack 发送警报。

5. 环境变量 (`.env`)

# Choose which pool is primary (blue or green)
ACTIVE_POOL=blue

# Labels for the two app containers
APP_BLUE_POOL=blue
APP_GREEN_POOL=green

# Release identifiers (optional, useful for tracing)
RELEASE_ID_BLUE=2025-12-09-blue
RELEASE_ID_GREEN=2025-12-09-green

# Nginx upstream selector – will be substituted in the template
UPSTREAM_POOL=${ACTIVE_POOL}_pool

# Watcher settings (adjust as needed)
ERROR_RATE_THRESHOLD=2
WINDOW_SIZE=200
ALERT_COOLDOWN_SEC=300

# Slack webhook (leave empty to disable alerts)
SLACK_WEBHOOK_URL=

当 ACTIVE_POOL=blue 时，Nginx 模板会将 UPSTREAM_POOL 解析为 blue_pool，从而使 Blue 服务成为主上游。

6. Docker Compose 文件

version: "3.9"

services:
  app-blue:
    build: ./app
    environment:
      - APP_POOL=${APP_BLUE_POOL}
      - RELEASE_ID=${RELEASE_ID_BLUE}
    ports: []   # not exposed directly
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/healthz"]
      interval: 5s
      timeout: 2s
      retries: 2

  app-green:
    build: ./app
    environment:
      - APP_POOL=${APP_GREEN_POOL}
      - RELEASE_ID=${RELEASE_ID_GREEN}
    ports: []   # not exposed directly
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/healthz"]
      interval: 5s
      timeout: 2s
      retries: 2

  nginx:
    image: nginx:1.25-alpine
    depends_on:
      - app-blue
      - app-green
    ports:
      - "8080:80"
    volumes:
      - ./nginx/nginx.conf.template:/etc/nginx/nginx.conf.template:ro
      - ./nginx/log:/var/log/nginx
    environment:
      - ACTIVE_POOL=${ACTIVE_POOL}
      - UPSTREAM_POOL=${UPSTREAM_POOL}
    command: /bin/sh -c "envsubst '\$UPSTREAM_POOL'  /etc/nginx/nginx.conf && nginx -g 'daemon off;'"

  watcher:
    build:
      context: ./watcher
    depends_on:
      - nginx
    volumes:
      - ./nginx/log:/var/log/nginx
    environment:
      - ACTIVE_POOL=${ACTIVE_POOL}
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
      - ERROR_RATE_THRESHOLD=${ERROR_RATE_THRESHOLD}
      - WINDOW_SIZE=${WINDOW_SIZE}
      - ALERT_COOLDOWN_SEC=${ALERT_COOLDOWN_SEC}
    # Remove this service if you don't need Slack alerts

nginx 服务在启动前使用 envsubst 替换模板中的 $UPSTREAM_POOL。

7. 运行演示

# Start everything
docker compose --env-file .env up -d

# Verify Nginx health endpoint
curl http://localhost:8080/healthz
# → should return "healthy"

# Call the application through the load balancer
curl http://localhost:8080/

你应该会看到包含 X-App-Pool 头信息的 JSON 响应（默认是 Blue）。

模拟故障

# Put the Blue app into chaos mode (force 500 errors)
curl -X POST "http://localhost:8080/chaos/start?mode=error"

# Or simulate a timeout
curl -X POST "http://localhost:8080/chaos/start?mode=timeout"

故障触发后，对 http://localhost:8080/ 的后续请求会自动由 Green 池提供服务，这得益于 Nginx 的 proxy_next_upstream 逻辑。若启用了监视器，它会向 Slack 发送故障切换的警报。

要停止 chaos：

curl -X POST "http://localhost:8080/chaos/stop"

8. 切换活动池

如果想在不触发故障切换的情况下让 Green 成为主池，只需编辑 .env：

ACTIVE_POOL=green

然后重新启动 Nginx（或整个堆栈），使模板重新生成：

docker compose up -d --no-deps --build nginx

此后新流量会优先路由到 Green，Blue 仍保持待机状态。

9. 清理

docker compose down -v

-v 参数会删除存放 Nginx 日志的匿名卷。

10. 你学到了什么

使用纯 Docker Compose 实现 Blue/Green 模式——无需 Kubernetes。
Nginx upstream 配置中的 max_fails、fail_timeout 与 proxy_next_upstream 实现即时故障切换。
结构化的 JSON 访问日志，通过自定义头部暴露上游池信息。
简单的 chaos 接口，用于测试系统弹性。
可选的 监视器，将日志

使用 Nginx 自动故障转移构建蓝/绿部署

介绍

前提条件

项目结构

1. Node.js 应用

`package.json`

`app.js`

2. 两个池的 Docker 镜像

`Dockerfile`

3. Nginx 流量调度器

`nginx/nginx.conf.template`

4. 可选的 Slack 监视器

`watcher/requirements.txt`

`watcher/watcher.py`

5. 环境变量 (`.env`)

6. Docker Compose 文件

7. 运行演示

模拟故障

8. 切换活动池

9. 清理

10. 你学到了什么

相关文章

我们发现我们的网站在新加坡很慢，但在欧洲却很完美——原因如下

我把Game Boy放进ChatGPT（ChatGPT Apps）

使用 Microsoft Planner 的营销经理的一天

spaceorbust – 终端RPG，GitHub提交驱动太空文明

介绍

前提条件

项目结构

1. Node.js 应用

package.json

app.js

2. 两个池的 Docker 镜像

Dockerfile

3. Nginx 流量调度器

nginx/nginx.conf.template

4. 可选的 Slack 监视器

watcher/requirements.txt

watcher/watcher.py

5. 环境变量 (.env)

6. Docker Compose 文件

7. 运行演示

模拟故障

8. 切换活动池

9. 清理

10. 你学到了什么

相关文章

我们发现我们的网站在新加坡很慢，但在欧洲却很完美——原因如下

我把Game Boy放进ChatGPT（ChatGPT Apps）

使用 Microsoft Planner 的营销经理的一天

spaceorbust – 终端RPG，GitHub提交驱动太空文明

`package.json`

`app.js`

`Dockerfile`

`nginx/nginx.conf.template`

`watcher/requirements.txt`

`watcher/watcher.py`

5. 环境变量 (`.env`)