Solved: Pacemaker/DRBD: Auto-failback kills active DRBD Sync Primary to Secondary. How to prevent this?

Published: 1 month ago (December 27, 2025 at 06:09 PM EST)

8 min read

Source: Dev.to

Executive Summary

TL;DR: Pacemaker’s default auto‑failback behavior can disrupt an active DRBD primary by attempting premature promotion on a recovering node, leading to service outages and potential data risks. This issue can be prevented by:

Configuring negative resource stickiness (e.g., -10000) on the DRBD master/slave clone resource.
Implementing manual failback (standby mode or location constraints).
Setting up graceful, delayed promotion with robust STONITH, increased cluster-delay, and generous promoted-stop-timeout values.

Why This Happens

Pacemaker/DRBD clusters provide high availability, but the default behavior often tries to “fail back” resources to their preferred node as soon as that node recovers. In a DRBD setup this can be disastrous:

Symptom	Description
Service outages	Applications on the active DRBD primary stop or become unresponsive.
DRBD status changes	The primary flips to Secondary, Unknown, or shows a conflict state (e.g., `WFConnection`, `StandAlone`).
Pacemaker log entries	Logs show promotion attempts on the recovering node and demotion or fencing actions on the current primary. Look for `drbd_promote`, `drbd_demote`, or conflict messages.

Example Pacemaker Log Snippet

Sep 20 10:35:01 node-a pacemakerd[12345]: info: Status: Requesting promote of drbd_res on node-a
Sep 20 10:35:01 node-a pacemakerd[12345]: crit: Result: promote_drbd_res_on_node-a: CIB_R_ERR_OP_FAILED
Sep 20 10:35:01 node-b pacemakerd[12345]: info: Status: Requesting demote of drbd_res on node-b
Sep 20 10:35:01 node-b pacemakerd[12345]: info: drbd_demote: stdout [drbd_demote: Attempting to demote resource 'r0']
Sep 20 10:35:02 node-b pacemakerd[12345]: warn: drbd_demote: stderr [drbd_demote: Cannot demote 'r0', it is still in use.]
Sep 20 10:35:02 node-b pacemakerd[12345]: crit: Result: demote_drbd_res_on_node-b: CIB_R_ERR_OP_FAILED

Running drbd-overview during the issue will show unexpected roles or connections.

Example `drbd-overview` Output During Conflict

0:r0   Connected Primary/Primary UpToDate/UpToDate
       [WARNING: This indicates split‑brain in a two‑node cluster, which Pacemaker should prevent]
       [More likely, you'll see a quick flip or errors.]

Core Problem

Resource locality preference – Pacemaker prefers to keep resources on their “preferred” nodes. When a node recovers, Pacemaker treats it as a suitable candidate again.
DRBD primary requirement – In a two‑node synchronous (Protocol C) DRBD setup, only one node may be Primary at a time.
Premature promotion attempt – Pacemaker may try to promote the DRBD resource on the recovering node immediately, before the current primary can be safely demoted.
Resulting conflict – This can lead to:
- Promotion failure (if the RA detects another primary).
- A race condition where both nodes briefly think they should be primary.
- DRBD’s internal conflict resolution (automatic demotion or fencing), which can cause I/O disruption and application failure.

Solutions

1. Negative Resource Stickiness

# Example: set a very high negative stickiness on the DRBD clone
pcs resource defaults resource-stickiness=-10000

Guarantees that Pacemaker won’t automatically fail back the DRBD resource.
The resource stays on the current primary until an administrator moves it manually.

2. Manual Failback Strategies

Method	How It Works
Standby mode	Put the recovering node into standby (`pcs node standby <node>`). Pacemaker will not schedule any resources on it until you clear standby.
Location constraints with negative scores	Create a location rule that gives the Promoted role a large negative score on the recovering node. Example:
	```bash
pcs constraint location drbd_res prefers node-a=INFINITY
pcs constraint location drbd_res avoids node-b=-1000 role=Promoted

| **Explicit move** | Use `pcs resource move drbd_res <node>` when you’re ready to promote. |

### 3. Graceful & Delayed Promotion  

1. **Robust STONITH** – Ensure fencing works reliably; otherwise Pacemaker may skip safe demotion.  
2. **Increase `cluster-delay`** – Gives the cluster more time to propagate state changes before acting.  

   ```bash
   pcs property set cluster-delay=30s

Set generous promoted-stop-timeout – Allows the old primary to finish demotion cleanly.
```
pcs resource update drbd_res meta promoted-stop-timeout=120s
```
Optional: migration-threshold – Prevents rapid back‑and‑forth moves.
```
pcs resource defaults migration-threshold=3
```

Checklist for a Safe DRBD‑Pacemaker Cluster

STONITH configured and tested on all nodes.
Negative stickiness (or equivalent location constraints) applied to DRBD resources.
Cluster‑delay and promoted‑stop‑timeout values tuned for your workload.
Monitoring (e.g., pcs status, drbd-overview, Pacemaker logs) set up to alert on promotion/demotion events.
Documentation of manual failback procedures for operators.

TL;DR (Re‑stated)

Set a high negative resource-stickiness (or use location constraints) to stop automatic failback.
When you need to move the primary, do it manually (standby, pcs resource move, or explicit location rules).
If you prefer automatic failback, make sure STONITH works, increase cluster-delay, and raise promoted-stop-timeout so the old primary can demote cleanly before the new one promotes.

By following these steps you can avoid “kill” scenarios, keep DRBD resources stable, and maintain true high‑availability for your services.

Preventing “kill” Scenarios When a DRBD Node Recovers

When a node that was previously Primary comes back online while another node is already acting as Primary, Pacemaker may try to promote the recovering node again. If the promotion cannot be performed cleanly, the active node is forced out of its role, resulting in an un‑graceful shutdown of services (the so‑called kill).

Typical Causes

Lack of graceful demotion – Pacemaker does not have enough time or a clear mandate to demote the current Primary before the recovering node asserts itself.
Insufficient or slow fencing (STONITH) – The cluster cannot reliably isolate the failing node.

Below are three proven ways to avoid this situation while keeping the cluster highly available.

1️⃣ Keep the Primary “Sticky” (Negative Resource‑Stickiness)

Idea – Tell Pacemaker never to move the DRBD resource back to a node that has just recovered, unless an administrator does it manually.

How it works

A large negative resource-stickiness on the DRBD clone makes the current Primary “sticky”.
Optionally add a location constraint that prefers the node that already holds the Primary.

Example configuration

# 1. Define the DRBD Master/Slave resource (example)
pcs resource create drbd_r0 ocf:linbit:drbd \
    drbd_resource=r0 \
    op monitor interval="60s" \
    op promote interval="30s" start-timeout="90s" stop-timeout="90s" \
    op demote interval="30s" start-timeout="90s" stop-timeout="90s" \
    --clone globally-unique=true ordered=true interleave=true

# 2. Add a huge negative stickiness → disables automatic fail‑back
pcs resource meta drbd_r0-clone resource-stickiness=-10000

# 3. Filesystem resource that depends on drbd_r0 being Primary
pcs resource create fs_data ocf:heartbeat:Filesystem \
    device="/dev/drbd/by-res/r0" directory="/mnt/data" fstype="ext4" \
    op monitor interval="30s"

# 4. Ensure fs_data runs only when drbd_r0-clone is promoted
pcs constraint colocation add fs_data with drbd_r0-clone INFINITY target-role=Promoted

# 5. Ensure ordering: promote DRBD first, then start the filesystem
pcs constraint order promote drbd_r0-clone then start fs_data

Pros	Cons
Highly predictable and reliable.	Requires manual `pcs resource move/migrate` to fail‑back after recovery.
Prevents split‑brain caused by aggressive auto‑fail‑back.	Downtime may increase if the admin is slow to intervene.
Simplifies troubleshooting – no resource flapping.	Resources can stay on a less‑preferred node for a long time.

2️⃣ Put the Recovering Node in Standby (Maintenance Mode)

Idea – Prevent Pacemaker from starting any resources on the node that just came back, giving you time to verify its health before allowing a promotion.

Steps

# On the admin workstation (or any cluster node)
pcs node standby <node>

The node stays in standby; Pacemaker will not schedule resources on it.
After verification, bring it back:

pcs node unstandby <node>

Then manually move the DRBD resource if you want a fail‑back.

Pros	Cons
Gives complete administrative control over fail‑back.	Requires continuous monitoring and manual steps after each recovery.
Minimises risk of unintentional primary conflicts.	Potentially longer downtime because human interaction is needed.
Guarantees node health before any promotion.	Less “automatic” for a typical HA environment.

3️⃣ Use a Location Constraint to Block Promotion on Recovery

Idea – Assign a very low (or -INFINITY) score to the recovering node for the Promoted role, so Pacemaker will never promote the DRBD resource there automatically.

Example

# Assume node‑a is the preferred primary.
# When node‑a recovers we want to keep it from promoting drbd_r0.

# Prefer the other node (node‑b) for the Promoted role
pcs constraint location drbd_r0-clone prefers=node-b=100 target-role=Promoted

# Explicitly avoid the recovering node for promotion
pcs constraint location drbd_r0-clone avoids=node-a=-INFINITY target-role=Promoted

You can combine this with the negative stickiness from Solution 1 for extra safety.

Pros	Cons
Provides fine‑grained control without putting the whole node in standby.	Still requires manual intervention to perform a fail‑back.
Prevents automatic promotion, thus avoiding split‑brain.	Slightly more complex to maintain the constraints.
Works together with stickiness for a “double‑lock”.	May need adjustments if the cluster topology changes.

📋 Summary

Solution	How it stops the “kill” scenario	When to use it
Negative stickiness (Solution 1)	Keeps the current Primary on its node; the recovered node stays Secondary until an admin moves it.	Preferred when you want automatic fail‑over but manual fail‑back.
Standby/maintenance mode (Solution 2)	Stops Pacemaker from touching the recovering node at all, giving you time to verify health.	Useful in environments where node health checks are mandatory before any resource runs.
Location constraint (Solution 3)	Gives the recovering node a score that forbids promotion, while still allowing it to run as Secondary.	Good when you need per‑resource control without taking the whole node offline.

All three approaches achieve the same goal: the recovering node never automatically promotes its DRBD resource without explicit administrative approval. Choose the one that best matches your operational workflow and the level of automation you desire.

Solution Overview

The solution leverages several Pacemaker global options and resource meta‑attributes to ensure a sequential and controlled transition of the DRBD primary role. Key elements include:

Robust fencing (STONITH)
Increased cluster-delay for state propagation
Carefully configured timeouts for resource actions

1. Ensure Robust Fencing (STONITH)

Why?
If Pacemaker cannot reliably fence a failed node, no fail‑back strategy is truly safe.

# Enable STONITH globally
pcs property set stonith-enabled=true

# Define the quorum policy (choose stop or freeze as required)
pcs property set no-quorum-policy=stop   # or 'freeze'

# Create STONITH devices (example using fence_ipmi)
pcs stonith create fence_ipmi_node1 fence_ipmi \
    ipaddr=192.168.1.10 pcmk_host_list=node-a \
    login=admin passwd=password \
    op monitor interval=60s

pcs stonith create fence_ipmi_node2 fence_ipmi \
    ipaddr=192.168.1.11 pcmk_host_list=node-b \
    login=admin passwd=password \
    op monitor interval=60s

2. Increase `cluster-delay`

Give Pacemaker more time to propagate state changes and avoid premature decisions.

pcs property set cluster-delay=60s

This tells Pacemaker to wait 60 seconds after a node joins or leaves before making significant resource‑placement decisions.

3. Configure DRBD Timeouts

Parameter	Purpose
`promoted-stop-timeout`	Maximum time Pacemaker will wait for a demote operation to complete on a Master/Slave resource.
`stop-failure-is-fatal=false`	Prevents a failed demote (stop) from immediately marking the resource as permanently failed on that node.

pcs resource update drbd_r0-clone \
    op demote timeout="120s" promoted-stop-timeout="180s" \
    op stop timeout="120s" \
    meta stop-failure-is-fatal=false

4. Optional: Resource Stickiness for Preferred Node

If you want a graceful auto‑failback to a preferred node, you can add a modest positive stickiness or location rule, but ensure the safeguards above are in place.

Solved: Pacemaker/DRBD: Auto-failback kills active DRBD Sync Primary to Secondary. How to prevent this?

Executive Summary

Why This Happens

Example Pacemaker Log Snippet

Example `drbd-overview` Output During Conflict

Core Problem

Solutions

1. Negative Resource Stickiness

2. Manual Failback Strategies

Checklist for a Safe DRBD‑Pacemaker Cluster

TL;DR (Re‑stated)

Preventing “kill” Scenarios When a DRBD Node Recovers

Typical Causes

1️⃣ Keep the Primary “Sticky” (Negative Resource‑Stickiness)

How it works

Example configuration

2️⃣ Put the Recovering Node in Standby (Maintenance Mode)

Steps

3️⃣ Use a Location Constraint to Block Promotion on Recovery

Example

📋 Summary

Solution Overview

1. Ensure Robust Fencing (STONITH)

2. Increase `cluster-delay`

3. Configure DRBD Timeouts

4. Optional: Resource Stickiness for Preferred Node

Related posts

The $0 Localization Stack for Solo .NET Developers

Building an AI-Powered Code Editor: (part 2) LLM like interpreter

Networking for DevOps (Senior-Level, Production-Focused)

# The Engineering Behind Zero-Buffer 4K Streaming: A Deep Dive into High-Performance Smart4k IPTV Architecture

Executive Summary

Why This Happens

Example Pacemaker Log Snippet

Example drbd-overview Output During Conflict

Core Problem

Solutions

1. Negative Resource Stickiness

2. Manual Failback Strategies

Checklist for a Safe DRBD‑Pacemaker Cluster

TL;DR (Re‑stated)

Preventing “kill” Scenarios When a DRBD Node Recovers

Typical Causes

1️⃣ Keep the Primary “Sticky” (Negative Resource‑Stickiness)

How it works

Example configuration

2️⃣ Put the Recovering Node in Standby (Maintenance Mode)

Steps

3️⃣ Use a Location Constraint to Block Promotion on Recovery

Example

📋 Summary

Solution Overview

1. Ensure Robust Fencing (STONITH)

2. Increase cluster-delay

3. Configure DRBD Timeouts

4. Optional: Resource Stickiness for Preferred Node

Related posts

The $0 Localization Stack for Solo .NET Developers

Building an AI-Powered Code Editor: (part 2) LLM like interpreter

Networking for DevOps (Senior-Level, Production-Focused)

# The Engineering Behind Zero-Buffer 4K Streaming: A Deep Dive into High-Performance Smart4k IPTV Architecture

Example `drbd-overview` Output During Conflict

Preventing “kill” Scenarios When a DRBD Node Recovers

2. Increase `cluster-delay`