Solved: Pacemaker/DRBD: Auto-failback kills active DRBD Sync Primary to Secondary. How to prevent this?
Source: Dev.to
Executive Summary
TL;DR: Pacemakerās default autoāfailback behavior can disrupt an active DRBD primary by attempting premature promotion on a recovering node, leading to service outages and potential data risks. This issue can be prevented by:
- Configuring negative resource stickiness (e.g.,
-10000) on the DRBD master/slave clone resource. - Implementing manual failback (standby mode or location constraints).
- Setting up graceful, delayed promotion with robust STONITH, increased
cluster-delay, and generouspromoted-stop-timeoutvalues.
Why This Happens
Pacemaker/DRBD clusters provide high availability, but the default behavior often tries to āfail backā resources to their preferred node as soon as that node recovers. In a DRBD setup this can be disastrous:
| Symptom | Description |
|---|---|
| Service outages | Applications on the active DRBD primary stop or become unresponsive. |
| DRBD status changes | The primary flips to Secondary, Unknown, or shows a conflict state (e.g., WFConnection, StandAlone). |
| Pacemaker log entries | Logs show promotion attempts on the recovering node and demotion or fencing actions on the current primary. Look for drbd_promote, drbd_demote, or conflict messages. |
Example Pacemaker Log Snippet
Sep 20 10:35:01 node-a pacemakerd[12345]: info: Status: Requesting promote of drbd_res on node-a
Sep 20 10:35:01 node-a pacemakerd[12345]: crit: Result: promote_drbd_res_on_node-a: CIB_R_ERR_OP_FAILED
Sep 20 10:35:01 node-b pacemakerd[12345]: info: Status: Requesting demote of drbd_res on node-b
Sep 20 10:35:01 node-b pacemakerd[12345]: info: drbd_demote: stdout [drbd_demote: Attempting to demote resource 'r0']
Sep 20 10:35:02 node-b pacemakerd[12345]: warn: drbd_demote: stderr [drbd_demote: Cannot demote 'r0', it is still in use.]
Sep 20 10:35:02 node-b pacemakerd[12345]: crit: Result: demote_drbd_res_on_node-b: CIB_R_ERR_OP_FAILED
Running drbd-overview during the issue will show unexpected roles or connections.
Example drbd-overview Output During Conflict
0:r0 Connected Primary/Primary UpToDate/UpToDate
[WARNING: This indicates splitābrain in a twoānode cluster, which Pacemaker should prevent]
[More likely, you'll see a quick flip or errors.]
Core Problem
- Resource locality preference ā Pacemaker prefers to keep resources on their āpreferredā nodes. When a node recovers, Pacemaker treats it as a suitable candidate again.
- DRBD primary requirement ā In a twoānode synchronous (ProtocolāÆC) DRBD setup, only one node may be Primary at a time.
- Premature promotion attempt ā Pacemaker may try to promote the DRBD resource on the recovering node immediately, before the current primary can be safely demoted.
- Resulting conflict ā This can lead to:
- Promotion failure (if the RA detects another primary).
- A race condition where both nodes briefly think they should be primary.
- DRBDās internal conflict resolution (automatic demotion or fencing), which can cause I/O disruption and application failure.
Solutions
1. Negative Resource Stickiness
# Example: set a very high negative stickiness on the DRBD clone
pcs resource defaults resource-stickiness=-10000
- Guarantees that Pacemaker wonāt automatically fail back the DRBD resource.
- The resource stays on the current primary until an administrator moves it manually.
2. Manual Failback Strategies
| Method | How It Works |
|---|---|
| Standby mode | Put the recovering node into standby (pcs node standby <node>). Pacemaker will not schedule any resources on it until you clear standby. |
| Location constraints with negative scores | Create a location rule that gives the Promoted role a large negative score on the recovering node. Example: |
| ```bash | |
| pcs constraint location drbd_res prefers node-a=INFINITY | |
| pcs constraint location drbd_res avoids node-b=-1000 role=Promoted |
| **Explicit move** | Use `pcs resource move drbd_res <node>` when youāre ready to promote. |
### 3. Graceful & Delayed Promotion
1. **Robust STONITH** ā Ensure fencing works reliably; otherwise Pacemaker may skip safe demotion.
2. **Increase `cluster-delay`** ā Gives the cluster more time to propagate state changes before acting.
```bash
pcs property set cluster-delay=30s
-
Set generous
promoted-stop-timeoutā Allows the old primary to finish demotion cleanly.pcs resource update drbd_res meta promoted-stop-timeout=120s -
Optional:
migration-thresholdā Prevents rapid backāandāforth moves.pcs resource defaults migration-threshold=3
Checklist for a Safe DRBDāPacemaker Cluster
- STONITH configured and tested on all nodes.
- Negative stickiness (or equivalent location constraints) applied to DRBD resources.
- Clusterādelay and promotedāstopātimeout values tuned for your workload.
- Monitoring (e.g.,
pcs status,drbd-overview, Pacemaker logs) set up to alert on promotion/demotion events. - Documentation of manual failback procedures for operators.
TL;DR (Reāstated)
Set a high negative resource-stickiness (or use location constraints) to stop automatic failback.
When you need to move the primary, do it manually (standby, pcs resource move, or explicit location rules).
If you prefer automatic failback, make sure STONITH works, increase cluster-delay, and raise promoted-stop-timeout so the old primary can demote cleanly before the new one promotes.
By following these steps you can avoid ākillā scenarios, keep DRBD resources stable, and maintain true highāavailability for your services.
Preventing ākillāāÆScenarios When a DRBD Node Recovers
When a node that was previously Primary comes back online while another node is already acting as Primary, Pacemaker may try to promote the recovering node again. If the promotion cannot be performed cleanly, the active node is forced out of its role, resulting in an unāgraceful shutdown of services (the soācalled kill).
Typical Causes
- Lack of graceful demotion ā Pacemaker does not have enough time or a clear mandate to demote the current Primary before the recovering node asserts itself.
- Insufficient or slow fencing (STONITH) ā The cluster cannot reliably isolate the failing node.
Below are three proven ways to avoid this situation while keeping the cluster highly available.
1ļøā£ Keep the Primary āStickyā (Negative ResourceāStickiness)
Idea ā Tell Pacemaker never to move the DRBD resource back to a node that has just recovered, unless an administrator does it manually.
How it works
- A large negative
resource-stickinesson the DRBD clone makes the current Primary āstickyā. - Optionally add a location constraint that prefers the node that already holds the Primary.
Example configuration
# 1. Define the DRBD Master/Slave resource (example)
pcs resource create drbd_r0 ocf:linbit:drbd \
drbd_resource=r0 \
op monitor interval="60s" \
op promote interval="30s" start-timeout="90s" stop-timeout="90s" \
op demote interval="30s" start-timeout="90s" stop-timeout="90s" \
--clone globally-unique=true ordered=true interleave=true
# 2. Add a huge negative stickiness ā disables automatic failāback
pcs resource meta drbd_r0-clone resource-stickiness=-10000
# 3. Filesystem resource that depends on drbd_r0 being Primary
pcs resource create fs_data ocf:heartbeat:Filesystem \
device="/dev/drbd/by-res/r0" directory="/mnt/data" fstype="ext4" \
op monitor interval="30s"
# 4. Ensure fs_data runs only when drbd_r0-clone is promoted
pcs constraint colocation add fs_data with drbd_r0-clone INFINITY target-role=Promoted
# 5. Ensure ordering: promote DRBD first, then start the filesystem
pcs constraint order promote drbd_r0-clone then start fs_data
| Pros | Cons |
|---|---|
| Highly predictable and reliable. | Requires manual pcs resource move/migrate to failāback after recovery. |
| Prevents splitābrain caused by aggressive autoāfailāback. | Downtime may increase if the admin is slow to intervene. |
| Simplifies troubleshooting ā no resource flapping. | Resources can stay on a lessāpreferred node for a long time. |
2ļøā£ Put the Recovering Node in Standby (Maintenance Mode)
Idea ā Prevent Pacemaker from starting any resources on the node that just came back, giving you time to verify its health before allowing a promotion.
Steps
# On the admin workstation (or any cluster node)
pcs node standby <node>
- The node stays in standby; Pacemaker will not schedule resources on it.
- After verification, bring it back:
pcs node unstandby <node>
- Then manually move the DRBD resource if you want a failāback.
| Pros | Cons |
|---|---|
| Gives complete administrative control over failāback. | Requires continuous monitoring and manual steps after each recovery. |
| Minimises risk of unintentional primary conflicts. | Potentially longer downtime because human interaction is needed. |
| Guarantees node health before any promotion. | Less āautomaticā for a typical HA environment. |
3ļøā£ Use a Location Constraint to Block Promotion on Recovery
Idea ā Assign a very low (or
-INFINITY) score to the recovering node for the Promoted role, so Pacemaker will never promote the DRBD resource there automatically.
Example
# Assume nodeāa is the preferred primary.
# When nodeāa recovers we want to keep it from promoting drbd_r0.
# Prefer the other node (nodeāb) for the Promoted role
pcs constraint location drbd_r0-clone prefers=node-b=100 target-role=Promoted
# Explicitly avoid the recovering node for promotion
pcs constraint location drbd_r0-clone avoids=node-a=-INFINITY target-role=Promoted
- You can combine this with the negative stickiness from SolutionāÆ1 for extra safety.
| Pros | Cons |
|---|---|
| Provides fineāgrained control without putting the whole node in standby. | Still requires manual intervention to perform a failāback. |
| Prevents automatic promotion, thus avoiding splitābrain. | Slightly more complex to maintain the constraints. |
| Works together with stickiness for a ādoubleālockā. | May need adjustments if the cluster topology changes. |
š Summary
| Solution | How it stops the ākillā scenario | When to use it |
|---|---|---|
| Negative stickiness (SolutionāÆ1) | Keeps the current Primary on its node; the recovered node stays Secondary until an admin moves it. | Preferred when you want automatic failāover but manual failāback. |
| Standby/maintenance mode (SolutionāÆ2) | Stops Pacemaker from touching the recovering node at all, giving you time to verify health. | Useful in environments where node health checks are mandatory before any resource runs. |
| Location constraint (SolutionāÆ3) | Gives the recovering node a score that forbids promotion, while still allowing it to run as Secondary. | Good when you need perāresource control without taking the whole node offline. |
All three approaches achieve the same goal: the recovering node never automatically promotes its DRBD resource without explicit administrative approval. Choose the one that best matches your operational workflow and the level of automation you desire.
Solution Overview
The solution leverages several Pacemaker global options and resource metaāattributes to ensure a sequential and controlled transition of the DRBD primary role. Key elements include:
- Robust fencing (STONITH)
- Increased
cluster-delayfor state propagation - Carefully configured timeouts for resource actions
1. Ensure Robust Fencing (STONITH)
Why?
If Pacemaker cannot reliably fence a failed node, no failāback strategy is truly safe.
# Enable STONITH globally
pcs property set stonith-enabled=true
# Define the quorum policy (choose stop or freeze as required)
pcs property set no-quorum-policy=stop # or 'freeze'
# Create STONITH devices (example using fence_ipmi)
pcs stonith create fence_ipmi_node1 fence_ipmi \
ipaddr=192.168.1.10 pcmk_host_list=node-a \
login=admin passwd=password \
op monitor interval=60s
pcs stonith create fence_ipmi_node2 fence_ipmi \
ipaddr=192.168.1.11 pcmk_host_list=node-b \
login=admin passwd=password \
op monitor interval=60s
2. Increase cluster-delay
Give Pacemaker more time to propagate state changes and avoid premature decisions.
pcs property set cluster-delay=60s
This tells Pacemaker to wait 60āÆseconds after a node joins or leaves before making significant resourceāplacement decisions.
3. Configure DRBD Timeouts
| Parameter | Purpose |
|---|---|
promoted-stop-timeout | Maximum time Pacemaker will wait for a demote operation to complete on a Master/Slave resource. |
stop-failure-is-fatal=false | Prevents a failed demote (stop) from immediately marking the resource as permanently failed on that node. |
pcs resource update drbd_r0-clone \
op demote timeout="120s" promoted-stop-timeout="180s" \
op stop timeout="120s" \
meta stop-failure-is-fatal=false
4. Optional: Resource Stickiness for Preferred Node
If you want a graceful autoāfailback to a preferred node, you can add a modest positive stickiness or location rule, but ensure the safeguards above are in place.