03-May-2015 - core02.tor1 router unavailbility - portmortem

Announcements concerning Networking & Related News, Planned Outages, Anything which may affect your services.

Moderator: Admins

Post Reply
porcupine
Site Admin
Posts: 703
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

03-May-2015 - core02.tor1 router unavailbility - portmortem

Post by porcupine »

Hello Everyone,

The following postmortem is a breakdown of the specific timings, events, and details surrounding the incident on 03-May-2015 where the core02.tor1 core router went offline at approximately 4:06AM Eastern time:

Timeline of Events:
4:06AM - SNMP logs from core02.tor1 indicate ISIS adjacency changes, some of the internal links between the core/distribution layer are flapping.
4:08AM - SNMP logs indicate that core01.tor1 > core02.tor1’s BGP session has dropped due to timeout.
4:09AM - Nagios alerts PC staff VIA emergency-pager to connectivity issues with the core routers, PC staff begin investigating.
4:10AM – 4:20AM – Issues are narrowed down to core02.tor1 device, as sessions drop, BGP peers begin to flap, the device cannot keep up with table loads, and a cascading effect occurs. Traffic which is external to the core02.tor1 device begins routing away from the device once the BGP sessions flap, as it is no longer a valid path.
4:26AM - Current efforts to bring the device under control are proving ineffective. Core02.tor1 router is issued the soft reload/reboot command to recover from the current state.
4:35AM - PC staff are dispatched to the site after the expected reload time has elapsed without results.
4:50AM - PC Staff arrive on-site, and start troubleshooting a presumed physical issue. It becomes apparent that the CF card in the core02.tor1 router that houses its IOS image has failed. A spare CF card is loaded up with the appropriate IOS image files, and loaded into core02.tor1.
5:10AM - Core02.tor1 router comes back online successfully, routing has reconverged, BGP sessions are back up, and impacted BGP customers (who are directly on core02.tor1) are back online.

Preventative Measures:
The incident in question occurred because BGP loads tables in a serial manner, which means while a table is loading, other updates cannot happen until this has completed. When there are too many routes to load effectively, one BGP session can drop, before another has finished loading, causing a cascade failure.

As a result of this, we have performed some optimizations with regards to how the routes are received, and reduced the table size on the core devices by approximately 20-30% (depending on the BGP session in question). This should reduce the resources, namely processing time necessary to load said tables, and hopefully prevent any incidents like this going forward.

We have also populated both CF disk slots on the device with the appropriate flash cards/configs to allow for failover in the event of a CF disk failure.
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
Post Reply