We have a 2-node heartbeat cluster that servers a virtual IP. Previous due to an error, the network interface for node1 died and resulted in the cluster kicking node1 from the virtual IP party.
Now that we have fixed it, node1 no longer gets to rejoin the virtual IP party. Setting node2 to standby does not trigger failover to node1.
I am unfamiliar with heartbeat. Is there a configuration/command anywhere that allows me to reverse/configure/un-blacklist this?
After some digging, it turns out that the failcount has hit its limit during the network interface debacle. Hence, the resource refuses to migrate back to the working node. I could view the failcount for each resource with :
pcs status failcount show <resource_id> [node]
$ pcs resource help
To solve it, I ran this :
that cleared up all the failcounts for my resources. (https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-handling.html) Now, the failover works and everything is fine now.