Tor2 Network incident - 30-Oct-2015 - Postmortem

Announcements concerning Networking & Related News, Planned Outages, Anything which may affect your services.

Moderator: Admins

Post Reply
porcupine
Site Admin
Posts: 703
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

Tor2 Network incident - 30-Oct-2015 - Postmortem

Post by porcupine »

Hi Folks,

As you would expect, we have been researching, troubleshooting, and otherwise diagnosing the log files, traffic patterns, and other identifying information regarding the connectivity incident that was experienced in our Tor2 facility on 30-Oct-2015. The information/events surrounding this incident are as follows, timestamps are in Eastern Time:


30-Oct-2015:
- 8:31am - PC monitoring system alerts that traffic to the RBS transport link dropped to 0kbps. Emergency-pager is alerted. Traffic is seamlessly traversing the CDS link as expected.
- 8:33am - PC staff begin diagnosing the RBS link. Both PC-RBS switches have light on the 10G wave transceivers, but both sides show interface is down. This rules out optical/switch failure, RBS support is contacted.
- 8:50am - PC staff get ahold of RBS support. RBS support believes there has been a fiber cut in a POP in Concord, Ontario. PC is added to the existing RBS master support ticket.
- 9:03am - PC staff note that traffic is looping through the network in an unexpected manner. Traffic is still passing through the RBS switch, despite the 10G wave being down. Traffic is forwarding from the RBS switch to the tor1 VXC platform, over the VXC platform to CDS, then back over CDS/tor2 VXC platform to the tor2 RBS switch. No issues with this behaviour are noted/detected, marked exact time for later internal investigation.
- 11:10am - One customer in tor2 notes that they cannot contact their tor1 servers VIA their external/routed interfaces, though their servers are otherwise still online, and VXC traffic is still online. This condition is not visible from the PC monitoring servers (either Nagios, or the primary SNMP monitor). This is when PC becomes aware of a service impacting issue and troubleshooting begins. Customer notes this behaviour has been occurring for approximately 80 minutes prior to reporting.
- 11:30am - PC staff determine missing/blocked traffic is very likely in relation to the strange routing on the RBS switch observed earlier. The 10GE Interface between the RBS switch, and the VXC switch is shutdown to force traffic to take the proper path (through CDS directly VIA core01.tor1 > core03.tor2).
- 11:30am - vxc01.tor1 switch reports an unknown ISIS routing error. [This is not observed at the time, but from later log analysis]
- 11:35am - Traffic between the two facilities does not reconverge over the CDS link. Neither Nagios, nor the primary SNMP monitor can access the majority of Tor2. For reasons unknown, dist04.tor2 is still accessible, leading to additional confusion.
- ~11:40am - After troubleshooting and expected fixes fail to resolve the routing issue, PC staff re-enable the 10GE port between the VXC switch, and the RBS switch.
- ~11:40 - 11:45am - The CDS core01.tor1 > core03.tor2 BGP link is shut down, to force traffic back over the RBS > VXC path, as traffic has not transitioned on its own. This works, and brings ISIS connectivity back up, but impacts BGP Customers in tor2. PC has a BGP session with CDS for transit in Tor2, but BGP customers in Tor2 still report to be offline.
- ~12:10pm - 1:00pm - PC staff are troubleshooting the various routing anomalies, attempting to figure out why traffic is dropping, routing sessions are flapping, etc. Some quick temporary fixes are put in place. Directly routed customers are online, but BGP customers appear to be offline.
- ~1:05pm - PC determines CDS transit isn’t accepting some of the customer prefixes. BGP link between core01.tor1 > core03.tor1 is re-enabled to bring BGP customers back online. This affects the directly routed customers, who can not push traffic on their external interfaces between the two facilities, but are otherwise online. BGP customers are once again online.
- 1:50pm - RBS 10GE wave comes back online, BGP customers take a hit as traffic reconverges/ports are re-enabled, and temporary routing fixes are removed.
- 2:00pm - Event has resolved, customers are back online. Investigation continues.



Resulting Actions & Measures:

- PC staff have been in communication with CDS, and have had CDS repair errors in their prefix filters (to ensure that BGP customers are fully/appropriately visible from the tor2 CDS transit circuit). This has been communicated with full details to the impacted BGP customers directly.
- RBS support has been contacted, and additional inquiries have been made regarding the fiber cut. Ultimately however, the fiber cut was a catalyst, and not the core cause of this event. Fiber cuts happen, and our network was designed with this in mind (diverse routes, entrances, transport, carriers, etc.).
- After extensive investigation, we believe that the initial routing incident was the result of non VXC traffic making its way over the VXC platform. This was isolated to a configuration error on the trunked ports (which occurred when Rogers was upgraded to 10G wave with a dedicated set of switches in July of 2015, and was corrected on Saturday night amid our postmortem investigation).
- Despite extensive searches, we have been unable to locate any information with regards to the ISIS error that occurred when the 10GE link between vxc01.tor1 and the RBS-tor1 switch was shut down. This error has not been encountered by any of the network operations community we are in contact with, and does not appear in the Cisco TAC support database (aka no previously known cases). Unfortunately this is a dead end, which we believe this was the result of the traffic looping through the VXC in an unintended manner, and we do not expect this to recur.
- We will be scheduling a maintenance window to further test, and attempt to recreate this event in a controlled manner on a later time/date (and obviously after-hours). We will reproduce the initial series of events to verify whether or not our network fixes/changes have been effective. This maintenance be announced separately once the involved planning & scheduling has been completed.
- Since Tor2 has a mix of customers, from directly routed, intra facility, vxc only, and transit only, various customers were affected at various times, in various manners, while few (IE: VXC only customers) were entirely unaffected. Unfortunately several of the changes fixed problems for one group of customers, while impacting another. Customers will receive proactive SLA credits based on the impact they experienced.


Please let me know directly if you require further details, or have any questions, or concerns regarding either this event, or this postmortem. Once again I'd like to extend my apologies for any inconvenience caused by this event.

Regards,
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
Post Reply