Major network incident at uksouth-1

Incident Report for Kuxo

Postmortem

Leadup

Describe the circumstances that led to this incident
Router 2 was started while Router 1 was in a memory snapshot (suspending the VM).

We were snapshotting the machine to revert to an old configuration, this was not routine maintenance and was a part of the effort to completely resolve incident #3.

Fault

Describe what failed to work as expected
Router 2 saw that Router 1 was down so started routing traffic, this is expected behaviour, then Router 1 exited it’s suspension and kept operating as if nothing happened.

Although Router 1 had a higher demotion level since it hadn’t reinitialised it’s interfaces (after it noticing it had lost time), it still entered MASTER mode on all VIPs.

Router 2 kept broadcasting as it’s demotion level was 0.

This caused traffic to be incorrectly routed.

Detection

Describe how the incident was detected

We lost connection to the VPN soon after starting the snapshot.

Root causes

Run a 5-whys analysis to understand the true causes of the incident

  • Router 2 and 1 did not correctly route traffic
  • Both routers were broadcasting they owned all VIPs
  • Routers not parsing CARP multicast requests properly
  • Further investigation required

Mitigation and resolution

What steps did you take to resolve this incident?

Rebooted Router 1, accessed vSphere, shut down Router 2.
Restored Router 2 and monitored CARP states as it restored.

Lessons learnt

What went well? What could have gone better? What else did you learn?

  • Our routers do not like memory snapshots!
  • All maintenance, including unscheduled and non-major should be broadcasted, no matter how small we believe it to be
  • Should seek more approval and a more steady review flow for network changes like this
  • Network change was made without direct confirmation from NOC lead during Shift 2, ensure clear communication and voices heard
Posted Mar 12, 2022 - 00:53 UTC

Resolved

This incident has been resolved.

We are in contact with VMware to investigate the root cause of the issue.
Posted Mar 11, 2022 - 23:38 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 11, 2022 - 23:37 UTC

Update

Basic networking and WAN has been restored
Posted Mar 11, 2022 - 23:36 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Mar 11, 2022 - 23:36 UTC

Investigating

During a routine virtual machine snapshot, virtual machines across our infrastructure and all hypervisors have became inaccessible.

All engineers are on deck now working to resolve the issue at hand.
Posted Mar 11, 2022 - 23:23 UTC
This incident affected: Sites (Kettering, UK (uksouth-1)) and Bare Metal Cloud (BMC) (Metal Networking Stack).