25-Mar-2009 core network incident post-mortem/Scheduled Main

Announcements concerning Networking & Related News, Planned Outages, Anything which may affect your services.

Moderator: Admins

Post Reply
porcupine
Site Admin
Posts: 711
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

25-Mar-2009 core network incident post-mortem/Scheduled Main

Post by porcupine »

Hello Everyone,

As promised, we have been diligently researching the connectivity issue that afflicted the core01.tor1 and core02.tor1 core routers on 25-Mar-2009, resulting in ~30 minutes of downtime for many PC customers.
As a result of this, we are scheduling service to the core02.tor1 core router on 05-Apr-2009, between 11:00pm, and 2:00am EST the following day. We will be taking the core02.tor1 router offline to change the IOS (Cisco operating system) image from the SX version, to the SR version (a different variant of the same system).

Customers who are not directly/physically connected to the core02.tor1 router (aka non-bgp customers) should not be impacted by this maintenance, but should consider this a precautionary maintenance window due to the nature of this work in our core network.

Our research into this matter, series of events, and intended fixes are as follows (for interested parties):


----- 25-Mar-09 - Order of Events (EST time-zone) -----
9:06pm - A desync between the RP, and FIB tables presents itself on core01.tor1, and is logged
9:06pm - 9:20pm - core01.tor1 > core02.tor1 BGP sessions reset, due to core01.tor1 RP/FIB Desync. BGP sessions re-establish without issue.
10:33pm - Monitoring indicates initial connectivity issues, investigation commences
10:34pm - core02.tor1 vanishes from the network. No crash logs are present, nor are any snmp or other logs recorded internally or externally
10:37pm - Multiple network admins are dispatched to site to investigate immediately
10:40pm - Building personnel are called to the site to check HVAC/UPS for issues 10:44pm - Network admins note that core01.tor1 is still pingable, but not passing traffic
10:51pm - Building personnel are instructed to hard-reboot core02.tor1 after verifying other infrastructure is intact.
11:03pm - core01.tor1 is also hard-rebooted
11:07pm - core01 completes booting to OS, BGP sessions are re-established, service is restored
11:10pm onwards - Post incident troubleshooting commences, core01.tor1 is taken down for emergency diagnostics (and IOS upgrade), and brought back online later that night. Investigation continues


----- Determining causes -----
core01.tor1 - The RP/FIB desync errors have been a reported result of poorly seated line-cards in the past. We suspect that work done on the HVAC system on 24-Mar-2009, temporarily lowering the temperature in the 818 suite, may have been responsible for minor thermal expansion, then contraction of said line-cards, resulting in shifting of the cards within the chassis. Errors of this nature had not been recorded prior to this date.

core02.tor1 - We believe core02.tor1 was affected by a Cisco bug (Bug ID: "CSCsy01292") due to the lack of crash info file, and lack of activity in *any* logs once the device disappeared. This is the only bug with matching symptoms (or lack thereof), and unfortunately has no official resolution (in our current/existing IOS version).


----- Going Forward -----
- Core01.tor1 will be taken down to have all line-cards, and chassis cleaned/reseated.
- Core02.tor1 will be taken down, to upgrade from the SX, to SR IOS version. The "CSCsy01292" bug is not known to affect the SR IOS versions.
- Core01.tor1 already had its IOS brought up to latest version last Wednesday (25-Mar-09) (precautionary)
- IPv6 Support has been removed from core02.tor1 until further notice (precautionary)
- Additional emergency physical access procedures have been added for building personnel (precautionary)

Work on the core01.tor1 router will be scheduled in the coming weeks (once various other processes complete, to minimize impact to customers who are physically connected).

I hope this provides a satisfactory answer to anyone who has been worried about this incident, and tracking this. If you have any questions, please do not hesitate to contact me directly.

Regards,
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
porcupine
Site Admin
Posts: 711
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

Post by porcupine »

Hi Guys,

Just to let everyone know, the network maintenance scheduled for 05-Apr-2009, 11:00pm - 2:00am EST has concluded, as per schedule.

I am pleased to report that the maintenance was completed successfully; full diagnostics were run on the core02.tor1 router (no issues found), and it's been upgraded to the SR IOS train without issue.

Regards,
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
Post Reply