20-Oct-2021 - Core02.tor1 routing issues - Post-mortem

Post by **porcupine** » Thu Oct 21, 2021 7:04 pm

Hi Folks,

During the "pre-business" hours on 20-Oct-2021, some customers experienced issues with routes transiting our core02.tor1 router, which resulted in dropped traffic across a number of routes transiting across this device.

Our core02.tor1 router faces our Level3 transit circuit, along with a number of private peering routes. As such, customers experienced issues
connecting to several external routes, which were being silently dropped by the device until the issue was reported and isolated.

The following post-mortem is a breakdown of the specific timing, events, and details surrounding this incident, timestamps are in Eastern (local) time:

Timeline of Events:

20-Oct-2021:
7:09am - Customer contacts PC staff by email indicating they're experiencing a routing issue, providing traceroute outbound from the Tor1 network with limited information (only 1 hop). PC staff respond requesting additional information.
7:23am - Same customer provides a second traceroute that gets a bit further in the remote providers network (3 hops), providing PC staff enough information to begin troubleshooting.
7:35am - PC staff ask customer if they can provide a more detailed traceroute, as we still don't know the remote IP, if it's the remote network, etc.
7:39am - PC staff confirm the issue is isolated to routes over core02.tor1, and that it's almost certainly the same CEF issue which impacted core01.tor1 ~2 weeks prior (to which core02.tor1 already had scheduled maintenance to mitigate).
7:42am - Core02.tor1's transit sessions are immediately shutdown, and peering sessions begin shutting down.
7:44am - A tweet is posted confirming issue with core02.tor1 router, referencing previous event on core01.tor1 (date of previous core01.tor1 event in tweet is incorrect, human error due to haste in posting).
7:47am - PC staff confirm to reporting customer that the issue has been identified, mitigation has commenced, issues should subside (given the substantial reduction in table usage/etc. from pulling transit routes). PC staff discuss plan of action, now that all external BGP sessions are now down, or will be soon.
7:49am - Initial customer reports a new route (internally transiting from core01.tor1, over, and through core02.tor1) is having the same issues. PC staff recognize that limping the core02.tor1 router along for any amount of time is not going to be feasible. An immediate reboot is required to flush the CEF tables.
7:56am - All tor2 facility traffic transiting core02.tor1 is shutdown to ensure issues remain isolated.
8:04am - A tweet is posted confirming that core02.tor1 will require an emergency reboot.
8:08am - All peering sessions on core02.tor1 have been shutdown, along with all customer peering sessions (where we know customers have alternative BGP paths).
8:11am - Metrics across the network are bumped to ensure all traffic (VIA BGP, ISIS, etc.) avoid the router outright. Core02.tor1 is now fully isolated from the network, and not impacting any customers (besides directly connected BGP customers, who now have one less path).
8:11am - PC staff determine the emergency reboot, will require all the same steps as the maintenance. The router is prepped for the minor OS upgrade that was planned for 24-Oct-2021, since it will not add any additional impact to the network/event.
8:21am - Core02.tor1 is rebooted.
8:30am - Core02.tor1 is back online, the shutdown/re-start have gone without any incidents/hiccups, PC staff realize they forgot the monitoring server was directly connected to core02.tor1, and several monitoring notices have slipped out (since the monitoring server lost its gateway). A tweet is posted to confirm/clarify this, and the mail queue on the monitoring server is flushed of such notices to prevent additional erroneous alerts from getting out.
8:30am - 9:30am - PC staff methodically reintroduce traffic/routes/BGP sessions to the core02.tor1 router to ensure that the device does not become overwhelmed, preventing any further impact to the network.
9:22am - A tweet is posted, confirming the incident is now closed out, all traffic has been fully returned to core02.tor1, and the maintenance scheduled for 24-Oct-2021 is no longer necessary, as all involved work has been performed.

As noted in the tweets (and events above), to re-iterate: The maintenance scheduled for core02.tor1 on 24-Oct-2021 is cancelled, given it was addressed with the emergency maintenance/reboot performed during this incident.

Preventative Measures:

As with the issue experienced on 07-Oct-2021 with core01.tor1, the best way to prevent CEF table bugs of this nature, is to ensure that routers do not have extensive uptimes (into the multiple years). In light of the issues suffered by core01.tor1, we had already planned preventative maintenance for this device, specifically to address the excessive uptime; Unfortunately the core02.tor1 router ran into the bug 4 days before said scheduled maintenance would have addressed the issue.

We do not expect core01.tor1, nor core02.tor1 to experience this sort of issue again in their existing hardware lifetimes (as plans have been play for awhile now, to replace both routers with newer hardware models).

Please let me know directly if you require further details, or have any questions, or concerns regarding either this event, or this post-mortem. Once again I'd like to extend my apologies for any inconvenience caused by this event.