24-Oct-2021 - Core01.tor1 routing issues - Post-mortem

Post by **porcupine** » Tue Oct 26, 2021 3:40 am

Hi Folks,

During the morning hours on Sunday 24-Oct-2021, some customers experienced issues with routes transiting our core01.tor1 router, which resulted in dropped traffic across a handful of routes transiting across this device.

Our core01.tor1 router faces our Tata transit circuit, along with peering on the Toronto Internet Exchange. As such, customers experienced issues connecting to several external routes, which were being silently dropped by the device until the issue was reported and isolated.

The following post-mortem is a breakdown of the specific timing, events, and details surrounding this incident, timestamps are in Eastern (local) time:

Timeline of Events:

24-Oct-2021:
8:50am - PC staff receive report from a customer having difficulty reaching a remote IP address. Staff test, and trace-routes make it out of the network, but do not appear to reach final target.
8:55am - Staff receive a second report of an unreachable IP address, this one is seen over TorIX.
9:08am - Staff note both trace-routes make it out of the network, but appear to die before the expected destination(s). Transit links over Tata and Level3 are gracefully shut down, one at a time, as staff suspect the issue might be with a path back into the network, since both trace-routes make it out.
9:14am - After bouncing the transit sessions above, staff note they can now connect to the IP that was previously unreachable over TorIX.
9:36am – A new customer reports an unreachable IP originating from Cogent’s network. Staff consider the possibility this is a remote outage, since no local issues have been identified, no other reports have come in, and the impacted customers are unable to provide any diagnostic information, or identify any other problem routes. Staff continue to investigate.
10:20am - PC staff verify that the CEF entries on all core routers appear to be intact. When checking the impacted IP’s, all have normal CEF entries, "sh ip route" produces expected results, etc.
10:42am – As the investigation continues, PC staff begin to suspect the issue might not be remote, but localized to core01.tor1. Metrics are adjusted to steer downstream traffic away from core01.tor1 VIA other routes as a precaution.
11:24am – A tweet is posted reflecting the current status.
11:46am - BGP sessions between Tor1 > Tor2 which terminate on core01.tor1 are shutdown as a precautionary measure.
11:58am - While researching the matter, PC staff discover that a platform diagnostic CEF lookup on core01.tor1 produces different results then what the router reports for a normal diagnostic CEF lookup regarding the impacted IP’s. Staff search documents to determine if results are correct, or if the check syntax used was flawed, since this shouldn’t be possible.
12:12pm – A new customer reports issues with a specific remote IP address, with trace-routes included showing where the path stops responding in each direction (aka full diagnostic information).
12:15pm - Peering session with impacted route is immediately shutdown, redirecting traffic from said path (resolving the issue with that path).
12:30pm – TorIX peering sessions are shutdown on core01.tor1. Transit route to Tata is also shutdown, all traffic not originating directly from router is now redirected, and core01.tor1 is effectively isolated, as the previous report has confirmed something is not right.
1:05pm – Staff determine that since rebooting the router is known to clear the CEF tables, rebooting the line-cards one at a time would identify which line-card was the source of the apparent CEF desync. PC staff begin testing by manually power cycling line-cards one at a time, hoping to find a bad card. The issue is reproducible after all line-cards have been rebooted. The only remaining card is the main supervisor card, staff decide the supervisor should be replaced.
2:05pm - Staff arrive at the Tor1 facility to physically inspect the chassis in case remote diagnostics might have missed anything. Nothing unusual is found, staff travel to the Tor2 facility to fetch a spare supervisor.
4:11pm – Staff return to Tor1 facility, and mount the spare supervisor as a hot-spare in the router. New supervisor does not behave in the expected manner. Staff work to determine if this is due to the current supervisor operating abnormally, or an issue with the replacement.
6:40pm – Staff cannot confirm why the replacement supervisor is not behaving as expected. It’s decided we shouldn’t risk causing another issue, when we have more replacements available for use. Staff head back to Tor2 to test the original replacement, and pick up a second spare supervisor.
9:45pm – Second replacement supervisor arrives in Tor1, and is behaving in the identical manner to the previously rejected replacement, it’s determined this must be the current supervisor causing issues/operating in an abnormal state considering the second replacement was tested extensively before leaving the Tor2 facility.
10:20pm – The core01.tor1 router is shut down, the second replacement supervisor is installed, and the router is brought back online.
10:50pm – New supervisor is operating normally, basic diagnostic checks are run and no issues are observed. Checks against previously failing routes all return normal results.
11:04pm – Public peering routes are reintroduced to the router, no issues observed.
12:35am – Transit session with Tata is re-enabled on the router, no issues observed.

Preventative Measures:
The immediate resolution to this issue, was to test all of the line-card modules (by power cycling said line-cards one at a time, to see if the issue disappeared shortly after the power cycle was complete). This did not produce results, leading us to conclude the supervisor card must be at fault. The supervisor card was replaced with a new supervisor from our inventory of spares.

Longer term resolution: There are plans to replace the core01.tor1 router with a new router, which will not only provide additional capacity, but more importantly provide more processing power, which will help keep network convergence times low, and help reduce future issues. We expect to be replacing the core01.tor1 router in the immediate future.

Please let me know directly if you require further details, or have any questions, or concerns regarding either this event, or this post-mortem. Once again I'd like to extend my apologies for any inconvenience caused by this event.