09-Oct-2018 Tor1 network issues - Postmortem

Post by **porcupine** » Wed Oct 10, 2018 4:58 am

Hello Folks,

As some customers will have noticed, the Tor1 facility had an unexpected network incident on 09-Oct-2018 causing multiple brief periods of connectivity issues, and latency for many customers in the Tor1 facility. The following postmortem is a breakdown of the specific timing, events, and details surrounding the connectivity incident, where the core01.tor1 router impacted connectivity to the Tor1 network. The information/events surrounding this incident are as follows, timestamps are in Eastern Time:

Timeline of Events:
2:15pm - Core01.tor1 - Logs indicate a memory allocation error on the supervisor. Cause: memory fragmentation (indicating memory depletion).
2:15pm - Core02.tor1 - BGP error notification received from core01.tor1, BGP update is malformed and discarded.
2:16pm - Core01.tor1 - A fatal error is triggered attempting to insert a IPv4 prefix into the Cisco Express Forwarding (CEF) table.
2:16pm - Core01.tor1 - CEF is disabled due to fatal error (above).
2:16pm - Core01.tor1 - BGP process realizes its out of memory, and attempts to re-initialize itself. Since BGP controls routing, this starts shutting down traffic to Core01.tor1 (traffic from the distribution layer [dist01.tor1/dist02.tor1] continues to flow VIA core02.tor1 with some traffic getting trapped on core01.tor1 as it shuts down sessions).
2:16pm - Dist01.tor1 & Dist02.tor1 - Distribution layer routers/switches start to see BGP sessions to core01.tor1 flap. Sessions to each distribution router flap another half dozen times over the following 40 minutes (flapping approximately every 7-8 minutes).
2:17pm - Core02.tor1 - BGP triggers a critical allocation failure when trying to allocate memory to a route (due to fragmentation). This appears to be in relation to the sudden, repeated, and significant shift of routes in relation to Core01.tor1's unexpected behavior. Core01.tor1 is aggressively flapping various internal interfaces/sessions at this point.
2:17pm - Core02.tor1 - BGP drops, then immediately re-establishes multiple sessions with random peers. This is the result of resource consumption involved in setting up/tearing down sessions repeatedly over a short period of time (from core01.tor1's flapping, not only of the direct interfaces, but also transit routes, ISL’s, etc.).
2:20pm - Core02.tor1 - Router is observed passing BGP errors from itself to other connected peers (malformed updates), likely in relation to the above events. Malformed updates do not drop traffic/sessions, but do show as errors.
2:20pm - Core01.tor1 - BGP peers begin the process of shutting down, CPU load has spiked. IS-IS connection (L2 routing) between core01.tor1 > core02.tor1 disconnects/reconnects several times as a result of insufficient resources.
2:23pm [to ~2:50pm] - Core01.tor1 - Router has largely isolated itself from the network. After shutting down & restarting its BGP process, it cannot get new sessions fully executed due to resource under-run. The router is mostly flapping sessions up/down as it runs out of resources/time before most sessions can stabilize, thus restarting the cycle. Most traffic is bypassing core01.tor1, because it doesn't remain online long enough for routes to establish/traffic to shift back.
2:25pm - PC staff start receiving automated alerts indicating various network issues, as the monitoring server hangs off the core, and various checks fail as traffic gets dropped, etc. Initial investigation begins.
2:29pm - PC staff gain remote access to both core routers VIA out of band access, both systems are very sluggish due to obvious resource strain.
2:34pm - While the problem is being remotely discussed & diagnosed, PC staff are dispatched to the site as a precaution.
2:37pm - Core02.tor1 - Core02.tor1 looses all routes for 8 seconds as BGP sessions flap a second time.
2:42pm - Core02.tor1 - Low memory errors are recorded in log.
2:53pm - PC staff arrive on-site, and check the physical conditions in relation to the network core.
2:56pm - Core01.tor1 - PC staff working on core01.tor1 VIA out of band access determine it's best to simply soft-reload the device, to flush out the memory since it's already effectively offline and isolated. A reload command is issued.
2:58pm - Core01.tor1 - Powers down for reboot.
3:05pm - Core01.tor1 - Confirms power on after cold-start, basic card diagnostics have passed without error.
3:09pm - Core01.tor1 - BGP sessions are back up, core01.tor1 is now stable.

Preventative Measures:
This incident is the result of a sudden and critical memory under-run scenario, where the available system memory was unable to accommodate the routers operating needs. The free memory pools on both of the Tor1 core routers have been diminishing in perpetuity, primarily due to IPv4 prefix de-aggregation/table growth, as would be expected of any BGP routers on the Internet today. The free memory pool on both cores is closely monitored (and graphed) by PC staff. The rate of free memory depletion has been consistent over a period of years, and was not projected to become problematic for several more months.

We have temporarily disabled some non-essential features on the core01.tor1 router to mitigate the immediate issue by helping to reduce resource strain. A chunk of memory was also free’d up during the reload/reboot, which also stabilizes the system.

As the supervisor cards in both core routers are already at their maximum memory configuration, the resolution will be to replace the supervisor cards entirely. An upgrade of this nature was already in the plans before year-end, this unexpected event will push our time-frame forward. We will be announcing a maintenance window shortly to address core01.tor1’s upgrade, which will be performed in earnest.

Please let me know directly if you require further details, or have any questions, or concerns regarding either this event, or this postmortem. Once again I'd like to extend my apologies for any inconvenience caused by this event.