Describe the circumstances that led to this incident
Router 2 was started while Router 1 was in a memory snapshot (suspending the VM).
We were snapshotting the machine to revert to an old configuration, this was not routine maintenance and was a part of the effort to completely resolve incident #3.
Describe what failed to work as expected
Router 2 saw that Router 1 was down so started routing traffic, this is expected behaviour, then Router 1 exited it’s suspension and kept operating as if nothing happened.
Although Router 1 had a higher demotion level since it hadn’t reinitialised it’s interfaces (after it noticing it had lost time), it still entered MASTER mode on all VIPs.
Router 2 kept broadcasting as it’s demotion level was 0.
This caused traffic to be incorrectly routed.
Describe how the incident was detected
We lost connection to the VPN soon after starting the snapshot.
Run a 5-whys analysis to understand the true causes of the incident
What steps did you take to resolve this incident?
Rebooted Router 1, accessed vSphere, shut down Router 2.
Restored Router 2 and monitored CARP states as it restored.
What went well? What could have gone better? What else did you learn?