Tor2 - Spring-2020 - Network stability issues

Post by **porcupine** » Sun Aug 16, 2020 5:14 am

Hi Folks,

As some customers will have noticed, over the past several months the tor2 facility has experienced several minor, but still inconvenient routing blips. These have been the result of the hardware faults, initially on the core03.tor2 core router (on 26-Apr-2020, and 13-Jul-2020), but then appearing on the core04.tor2 and dist04.tor2 routers in rapid succession (on 02-Aug-2020, and 04-Aug-2020 respectively).

Since only BGP customers are directly connected to the core routers, and very few customers are exclusively connected to the dist04.tor2 router, most customers in the tor2 facility experienced brief blips, instead of prolonged outages. Initially with just core03.tor2 having been impacted, we were looking at the problem with a very narrow scope -- assuming hardware issue specific to that device, and had taken a number of measures to address this, including: replacing the supervisor, preparing a standby/spare chassis, and loading the swapped out supervisor/cards into said spare chassis for extended diagnostics, etc.

When dist04.tor2 experienced a similar issue, then core04.tor2 rebooted unexpectedly 2 days later, we knew the issue was something bigger. With the issue now presenting on multiple unique pieces of hardware, spread across various line-card configurations/supervisor types/chassis types/etc., the additional incidents provided our troubleshooting efforts with considerably more information.

Based on this development, we've concluded the most probable common factor between the three pieces of hardware is the additional memory in the supervisor cards. When the cards were installed, additional memory had been purchased to bring all of the supervisors to their platform maximum capacity, to safeguard against potential capacity concerns down the road. All of the memory was sourced from the same supplier in the same purchase/batch (including a number of spare modules), and was installed into each of the supervisors in question. Given when dist04 rebooted, it complained about correctable ECC errors immediately prior to reloading. We're reasonably certain the supervisor memory is at fault.

Preventative Measures:
Going forward, several measures are being taken to address this issue:
#1 - We will be replacing the memory in all of the impacted (and spare) supervisors, with new memory from a different vendor. While the original vendor has offered to replace the modules, we don't trust the source at this point in time.
#2 - We have already loaded secondary supervisors into the dist04.tor2, and core04.tor2 routers, running in hot-standby on each. We will be doing this same for the core03.tor2, and dist03.tor2 routers when the time comes. This was not done before, as it's a reasonably expensive option on this platform (both in terms of hardware costs, but also in terms of slot consumption, power consumption, system capability, etc.), but at this point in time, we feel it's the best path forward.
#3 - The secondary supervisor in each router will be configured in SSO mode (aka Stateful Switch-Over, or "hot-standby" mode).
#4 - Each router will have the NSF (Non-Stop Forwarding), and NSR (Non-Stop Routing) configurations implemented, which should ensure if a router fails from its primary to secondary supervisor, that switching, BGP sessions, etc. are not re-set, ensuring the transition is as seamless as possible.

We believe this will provide the most redundancy possible on the current platform, and increase network stability in the Tor2 facility. This will involve several maintenance periods (as replacing memory, and enabling NSF/NSR modes require a chassis reboot), and we will be scheduling the maintenance on both dist04.tor2 and core04.tor2 immediately, with core03.tor2 to follow at a later date.

Last, but not least, we apologize for the delay in publishing this information. We always strive to put out reports ASAP after any service impacting incidents. Unfortunately when following up on this incident, we spent a considerable amount of time waiting on vendors for information/confirmations/etc., which resulted in delays finalizing the planned preventative measures.