07-Oct-2021 - Core01.tor1 routing issues - Post-mortem

Post by **porcupine** » Thu Oct 07, 2021 11:29 pm

Hi Folks,

As a number of customers detected throughout the early afternoon today, we experienced issues with our core01.tor1 router, which resulted in latency, dropped traffic, and unpredictable behaviour across a number of routes transiting across this device.

Our core01.tor1 router faces our Toronto Internet Exchange (TorIX) connection, along with one of our transit routes - Tata Communications Canada. As such, a number of customers experienced issues connecting to major TorIX peers such as Microsoft, Google, and AWS, which had seemingly random and unpredictable impacts on traffic transiting this router, often where existing connections worked, but new connections could not be formed, etc.

The following post-mortem is a breakdown of the specific timing, events, and details surrounding this incident, timestamps are in Eastern (local) time:

Timeline of Events:
07-Oct-2021:

1:20pm - Initial customer inquiry regarding apparent issue with single unspecified route, asking if we see any problems on the network. A quick check of the active stats/logs ensues, and the customer is informed of which checks have been made, and no issues have been detected. PC staff request more details from initial reporting customer.
1:39pm - Initial reporting customer provides remote IP address they're having difficulty reaching from within the network, investigation commences.
1:42pm - A second customer contacts PC staff suggesting they're experiencing a routing issue, PC staff request details.
1:47pm - Initial route of concern is identified belonging to Microsoft, preferred path goes over TorIX, PC staff login to core01.tor1 to further investigate.
2:03pm - Second customer provides requested details. PC staff note both paths transit TorIX, suggesting potential issue with core01.tor1, or the Toronto Internet Exchange.
2:14pm - Additional staff are told to pause other work, to ensure all staff resources are available to diagnose and address issue.
2:15pm - Investigation slows, as neither remote IP's respond to ICMP ping queries. The lack of ping delays diagnostics, as PC staff have to do additional checks using external networks/resources, to help determine where connectivity/visibility to the IP's in question stops, and if the problem is genuine, specific to our network, etc.
2:20pm - Microsoft peering sessions are shut-down as a low risk method to further diagnose the issue. The new preferred path to Microsoft is now over Tata Communications (also on core01.tor1).
2:31pm - PC staff post tweet to maintenance twitter feed ( http://www.twitter.com/pcolomaint ), acknowledging probable issue accessing routes that travel over TorIX.
2:35pm - PC staff determine core01.tor1 is dropping traffic internally for reasons unknown. All paths to/from core01.tor1 appear to be valid, but traffic is clearly disappearing somewhere. After investigating for a number of potential causes, PC staff focus on the Cisco Express Forwarding [CEF] table of core01.tor1
2:45pm - PC staff note that CEF entries are missing for several of the reported routes which transit through core01.tor1, but have no entries in the forwarding table (which would result in traffic being dropped, or "blackhole'd" silently on the device).
2:50pm - Twitter post updated to indicate we're certain the issue is core01.tor1, and that we've identified issues within the CEF tables, along with plans for an immediate reboot of the device to resolve the issue.
2:53pm - Tata's transit connection to core01.tor1 is shutdown, preparations to reboot the device are made internally.
3:00pm - All TorIX peering sessions are now shutdown.
3:06pm - All internal customer, transit, and peering routes on core01.tor1 have been successfully shutdown, traffic is effectively "steered" away from the device, additional investigation commences before reboot is finalized.
3:08pm - Twitter post updated to reflect partial fix, that traffic has been steered away from core01.tor1, etc. Most customers saw any observed issues cease at this point.
3:19pm - Reboot command is issued on core01.tor1, the router starts it's shutdown process.
3:31pm - Core01.tor1 is back online, and begins re-initiating its routing tables/BGP sessions/etc. Confirmation of this is tweeted.
3:35pm - PC staff note that core01.tor1 router is being overwhelmed while re-loading all of the tables/paths/peers/etc. Some paths experience blips/flaps, as core01.tor1 struggles to keep up with the quantity of new routes/paths/etc. being fed to it upon re-connection to the network.
3:40pm - Update posted to twitter indicating core01.tor1 is back online, but struggling to load all routes, which may result in routes flapping/lag as the router finishes bringing itself online.
4:00pm+ - Routes (transit/peering/etc.) are slowly reintroduced to core01.tor1, to spread out the load and prevent the router from being overwhelmed.

In the ~15 minutes between all traffic being steered away from core01.tor1, and the device actually being rebooted, PC staff determined the cause of the CEF table errors was almost certainly uptime related. Core01.tor1 had an uptime of approximately 2 years, 11 months. Unfortunately a known side effect of any routing platform, is that excessive uptimes on any individual device can lead to unpredictable behavior (similar to most PC's/etc.). These sorts of errors are known by admins, but are not well documented, or understood beyond their existence as they're difficult to reproduce given the amount of time required to occur, and the amount of time to recur (IE: after a reboot), preventing meaningful troubleshooting.

Several customers had asked about logging, or detecting missing routes of this nature in CEF. Unfortunately that is not possible, as a technical solution does not exist to do so (in real time). All of the core routers already run CEF table consistency checks (designed to identify and notify of issues with the CEF tables), but those consistency checks can only run periodically, and not in real time.

With most CEF issues, visibility only occurs when traffic starts to disappear on a device, and the only known certain way to resolve the issue is to reboot the device, to ensure the tables are rebuilt from scratch, eliminating current errors from memory.

For customers interested in understanding a bit more about CEF/etc., a short write up can be found here: https://routing-bits.com/2009/06/02/understanding-cef/

Preventative Measures:
Ultimately since many of these errors occur on routers with very long uptimes, the easiest mitigation is to reboot the devices periodically (IE: every year or two).

Typically this is mitigated by the need to upgrade operating systems (resulting in reboots). In instances where it is not necessary (from a feature, or security standpoint, typically in platforms well into their life-cycles, IE: the current core routers), we will need to consider rebooting routers simply to prevent them from achieving such long uptimes.

Please let me know directly if you require further details, or have any questions, or concerns regarding either this event, or this post-mortem. Once again I'd like to extend my apologies for any inconvenience caused by this event.