Around 11:55pm on Friday November 5th, 2021 during a scheduled maintenance window, the tor1 network experienced a significant routing event, which lead to a substantial number of routes being unavailable, and the network appearing to be down from a number of locations.
While not immediately clear, this was the result of a CEF desync on the core02.tor1 router, a problem which we had experienced several times in the previous weeks on the core01.tor1 router, and had previously attributed to failing hardware/components/etc. until this event.
The following post-mortem is a breakdown of the specific timing, events, and details surrounding this incident, timestamps are in Eastern (local) time:
Timeline of Events:
04-Nov-2021 - Plans to perform maintenance on the core01.tor1 router were posted/announced, which involved fairly straight forward activities, primarily turning up TorIX and Tata connections on the new core01.tor1 router. The activities at hand were typical activities which would otherwise be performed unannounced, but we scheduled a maintenance period, as the past several weeks had not gone well in terms of network stability. Maintenance was scheduled for "after 10:00pm" on 05-Nov-2021, to ensure it was done outside of prime-time hours for as many customers as possible.
[Previous state] - Due to previous events/as previously indicated, all traffic had been routed away from the old core01.tor1 router, and was going through core02.tor1, as we were in the midst of replacing core01.tor1, and believed the previous CEF issue to be tied to core01.tor1's hardware.
10:00pm - Maintenance window begins, PC staff are busy putting final touches on the replacement routers configuration, and turning up initial sessions using test prefixes, to ensure initial testing is not occurring with live traffic.
11:15pm - Staff, happy with the current configurations, start turning up TorIX peering sessions, while monitoring for any impact/routing issues/etc.
11:55pm - After monitoring for ~30 minutes, a tweet is posted, indicating that TorIX routes have been turned up without issue. PC staff move on to turning up the Tata transit session.
12:04am - PC staff notice that the core02.tor1 router appears to be lagging considerably, failing to process the new routes from the new/faster core01.tor1 router, causing sessions to flap. A tweet is posted to acknowledge staff are aware of the issue. Tata session is temporarily turned down in an effort to stabilize the network.
12:14am - Tata's unique route volume is significantly reduced, by "squashing" routes smaller than /23 (a common practice for many networks), for the benefit of the older core02.tor1 router. Tata session is turned back up.
12:20am - Staff notice there are still connectivity issues, and start hunting across Zayo's, Level3's, and Tata's public looking glasses. Transit provider looking glasses perform very slowly, hampering troubleshooting efforts. Third party looking glasses [route-views.routeviews.org] show expected paths to the Tor1 network.
12:27am - Staff observe issues with routes to Rogers, which has a direct private peering link on the core02.tor1 router. Staff believe the routes are being actively suppressed on the Rogers network, as the peering session is temporarily shutdown, yet issues to Rogers persist after shutting down peering session.
12:34am - PC staff start looking for potential CEF issues, having exhausted most obvious avenues to explain the issue at hand.
12:42am - PC staff find/validate routes on the core02.tor1 router which have inconsistent CEF entries (existing in software, yet missing in hardware). These routes match up with now known problem routes. Staff determine the CEF issue is now presenting on a second piece of equipment (core02.tor1).
12:45am - As an emergency measure, the circuit between core03.tor2 and core01[old].tor1 is brought back up (it had been previously shut down, to avoid the old core01.tor1 router).
12:46am - A tweet is published acknowledging the issue has been identified, and is CEF related, a reboot of core02.tor1 will be necessary.
12:48am - Static routes added to core02.tor1 to bypass the broken CEF entries as an emergency stop-gap solution until reboot.
1:01am - Internal metrics are adjusted to steer traffic away from core02.tor1
1:05am - Staff continue to investigate the networks across both facilities, checking for other factors that may have played a hand in this incident, ensuring there are no lingering issues, and continuing to prep for core02.tor1 to reboot.
1:47am - Reboot is issued to core02.tor1 router.
1:51am - Core02.tor1 router comes back up from reboot.
2:23am - Reboot confirmed on twitter, traffic remains steered away from core02.tor1, the vast majority traffic is now transiting the new core01.tor1 router.
Unfortunately with this event, the conclusions drawn from the previous incidents appear to be incorrect. The CEF issue is unlikely to be isolated to failing hardware, if two diverse but identical routers are impacted in such short order. The timing of core02.tor1's failure couldn't have happened at a worse time, as all traffic had been steered towards this device, based on previous conclusions, and the failure has appreciably hampered staff efforts to cleanly move traffic to the new platform.
The only reasonable conclusion we can come to, is that the [old] core routing platform in the Tor1 facility is unsustainable. Two replacement QFX routers (one to replace core02.tor1, and one hardware spare) have been purchased, and will be installed as soon as practicable. Currently the vast majority of traffic has been redirected to the new core01.tor1 router, with some intra-facility traffic still transiting the old core01.tor1 router (due to the emergency measures put in place during this event).
We will be scheduling additional maintenance periods both to turn up the new core02.tor1 router, and repair/correct temporary measures put in place during this event. Customers originating from the old core routers are being individually contacted to arrange for service migration, and the old core routers will be retired once that is complete.
While we realize this event took place during a posted maintenance window, we also realize that events like this are undesirable, regardless of posted notices. We will continue to work tirelessly to try to avoid issues like this going forward.
Please let me know directly if you require further details, or have any questions, or concerns regarding either this event, or this post-mortem. Once again I'd like to extend my apologies for the inconvenience caused by this, and other related events.
1 post • Page 1 of 1