Tor2 Network incident - 08-Jul-2016 - Postmortem

Post by **porcupine** » Fri Jul 08, 2016 9:44 pm

Hi Folks,

As you would expect, we have been researching, troubleshooting, and otherwise diagnosing the log files, traffic patterns, and other identifying information regarding the connectivity incident that was experienced in our Tor2 facility on 08-Jul-2016. The information/events surrounding this incident are as follows, timestamps are in Eastern Time:

08-Jul-2016 Incident:
- 1:30pm - PC monitoring system alerts that traffic to the RBS transport link dropped to 0kbps. Emergency-pager is alerted. Traffic is seamlessly traversing the CDS link as expected.
- 1:33pm - PC staff verify transceiver light levels in both facilities, and contact RBS to report an incident.
- 1:38pm - RBS confirms that there is an issue, either with the WDM platform, or a major fiber cut, but it’s too soon to provide any details. A ticket is logged in their system, and we’re instructed to check back later.
- 1:45pm - While monitoring the situation, PC staff notice a considerable increase in log entries on the core04.tor2 router, which had been left running additional (debug) logging since the previous network maintenance(s) back in November, and December 2015.
- 1:50pm - PC staff notice that not all is well with the network. BGP sessions are starting to flap between the core01.tor1, and core03.tor2 + core04.tor2 routers. Staff begin attempting to isolate the cause.
- 1:55pm - PC staff elect to shut down the clearly flapping BGP sessions, as there are several paths of redundancy available, and this is clearly impacting traffic.
- 2:05pm - PC staff note that some traceroutes are still going VIA the paths which were shut down several minutes prior. Investigation continues.
- 2:15pm - Several more ISL circuits are shut down, and then returned to service one at a time, trying to isolate what is causing the flapping.
- 2:30pm - PC staff disable link between core01.tor1 router and CDS transport, in an effort to clear adjacent ISIS neighbors as this is where several traceroutes are dropping unexpectedly, and redundancy is still available.
- 2:37pm – PC staff re-enable the link between core01.tor1 router and CDS transport fiber, as the issue has been significantly compounded by shutting down this circuit. Up until 2:30pm, undesirable activity was primarily isolated to random BGP sessions flapping creating appreciable additional latency, shutting down the circuit creates a widespread outage in tor2.
- 2:40pm – PC staff observe that full functionality has been returned to the tor2 network after the previously downed link is restored, an unexpected result. Clearly resetting a single ISL link should not have this effect, and we are still dealing with a routing anomaly.
- 2:45pm - Network still appears to be fixed. Single customer reporting issues, investigation to identify and isolate those issues begins.
- 2:53pm – Specific customers’ issues are discovered, and immediately resolved. Everybody on network is back on-line, BGP sessions are no longer flapping, everything appears normal, all parties reporting normal. RBS transport/fiber is still offline.
- 2:55pm - Traffic is still flowing over the CDS transport circuit, as RBS circuit is hard down. Follow-up with RBS continues. RBS reports that the outage is the result of a major collision downing several hydro poles at the intersection of Woodbine and Apple Creek Road, downing a significant portion of RBS fiber. Over 430 counts of fiber are damaged and being repaired one at a time.
- 9:45pm - RBS reports an ETA of “after midnight” for repair work of damaged fiber. RBS call center and online presences report considerable outages, affecting customers across the board on their network. CDS path, and our Tor2 facility remain unaffected at this point.

Resulting Actions & Measures:
- Today’s incident clearly reveals that the previous network changes made in November and December 2015 were ineffective at resolving the traffic issues/routing anomalies which occurred when transport circuits were hard-downed in specific manners.
- Customers should note that during the maintenances on 19-Nov-2015, and the follow-up maintenance on 04-Dec-2015, despite simulating a transport failure by manually shutting down the interfaces (several times per period), and letting this simulated failure occur for many hours, PC staff were unable to replicate the outages, resulting in our belief that we had resolved the routing anomaly. Clearly this is not the case, and the issue is alive and well.
- We will be scheduling an emergency maintenance (notice to come out shortly) for 10-Jul-2016, at 11:59pm, which will have a window until 3:00am. During this maintenance window, we will be removing the Tor2 facility from the AS30176 ASN, and push all devices specifically to AS53999 (the ASN originally intended for this facility). ISIS routing between the two facilities will be removed, and the sessions will be established as straight EBGP sessions. This will not impact the virtual cross connect network (the VXC network), which was not impacted by this routing anomaly.
- During the aforementioned maintenance, since we need to reboot devices anyway (to change the default ASN on the routers), we will be pushing the gear to the most recent available/applicable IOS. While we firmly believe the routing issues/anomalies we have experienced are related to the ISIS layer2 routing protocol, we believe updating IOS when opportunities present is the prudent course of action.
- Impacted customers will receive proactive SLA credits based on the manner in which their services were affected by this incident, as is standard policy for any major outage.

Please let me know directly if you require further details, or have any questions, or concerns regarding either this event, or this postmortem. Once again I'd like to extend my apologies for any inconvenience caused by this event.