04-May -2011 & 06-May-2011 Network Incident - PostMortem

Post by **porcupine** » Mon May 09, 2011 5:33 am

Hello Everyone,

As you would expect, we have been diligently researching, and formulating corrective plans regarding the connectivity affecting incidents on 04-May-2011, and 06-May-2011. The following postmortem is a break-down of the specific timing, events, and details surrounding this incident. All times are posted in EST (Eastern Standard Time), as recorded in our log files:

Wednesday, 04-May-2011:

9:45pm - Nagios sends emergency-pages indicating latency/packet loss to the core routers.
9:46pm - Priority Colo staff, currently on-site begin investigating.
9:53pm - Administrative interfaces on the core routers are unresponsive due to CPU saturation. Unable to access the device, Priority Colo staff on-site perform an emergency-reboot on Core01.tor1 core router.
9:59pm - Core01.tor1 router has not yet returned from reboot, and the network has effectively become unresponsive. Priority Colo staff perform emergency reboot of Core02.tor1 core router.
10:03pm - Console cables are connected to Core01.tor1. Router is found in a loop, attempting to boot to non-existent alternate supervisor engine.
10:05pm - Console cables are moved to Core02.tor1 to check why it has not returned to service. Core02.tor1 is found to be in the Cisco boot loader (rommon), failing to boot properly, and refusing manual boot instructions from console.
10:05pm - 10:30pm - Priority Colo staff continue to troubleshoot the core routers boot issue.
10:35pm - Boot issue is determined to be caused by boot variables not saving correctly, an incorrect IOS image being present on the compact flash card in wait of a yet unscheduled software upgrade, and the supervisor attempting to boot the wrong IOS image as a result.
10:40pm - Compact flash card is removed from the core01.tor1 router, and reconfigured on staff console, removing all extraneous IOS image(s).
10:42pm - Repairs/corrections made to Compact flash card are successful, Core01.tor1 begins its boot sequence.
10:45pm - Core01.tor1 comes back online, re-establishes BGP sessions, and traffic begins routing normally once again. Excluding BGP customers on the Core02.tor1 router, all customers are back online at this time.
10:50pm - Core02.tor1 compact flash card is removed, and the disk is duplicated for full forensic investigation in an effort to determine original cause of network incident.
11:35pm - Compact Flash card from Core02.tor1 is reconfigured, is re-inserted into router. Core02.tor1 begins booting properly.
11:37pm - Core02.tor1 comes back online, bringing all transit providers back online.
11:57pm - Core02.tor1 reports "Unsupported module failure" in Fa2 linecard, and forces module power-down. The card is removed, and replaced with a cold-spare.
12:04am - Core02.tor1's Fa2 linecard configuration is restored from previous nights backup, customers on Fa2 are brought back online.

Friday, 06-May-2011:

7:13am - Nagios begins sending emergency-pages indicating latency/packet loss to various locations within the network.
7:15am - Priority Colo staff begin investigating the packet loss/latency.
7:20am - Several Priority Colo staff are dispatched to the site as a precautionary measure, exact cause of latency is still unknown. Investigation continues.
7:30am - Considerably higher than normal traffic is visible on several internal network links, destined to the dist02.tor1 distribution switch, investigating concentrates here.
7:34am - Priority Colo staff arrive onsite, and continue to investigate issues from local console(s).
7:38am - Single Fast-E port on the dist02.tor1 distribution router is noted to have an exceedingly high number of packets/sec inbound, suspicions of Distributed Denial of Service (DDoS) attack are quickly investigated and confirmed.
7:40am - IP's that are under attack are null-routed, traffic on the network resumes normal flow as the backscatter subsides, and load on the supervisor engines returns to normal.
7:55am - We begin procedures to blackhole the impacted IP's upstream of our network, ensuring packets are dropped before they reach our network edge.
8:00am - 11:00am - Priority Colo's Transit providers begin to remotely drop traffic to the impacted IP's, attacks reaching the network edge subside until no longer visible on the bandwidth graphs.

Preventative Measures:

The incident on Friday confirmed previous suspicions regarding the catalyst to Wednesday’s network failure being a Denial of Service attack. The following measures are being put into place to help prevent such occurrences in the future:

- The customer who unfortunately attracted this attack has been removed from our network, and we are assisting said customer in sourcing transit from a third party.
- Dist01.tor1, and Dist02.tor1 supervisor cards will be upgraded to Sup32-10GE supervisors, to provide additional processing power in the event of a Denial of Service attack.
- Policers will be setup on the supervisor cards to help protect the supervisor cards CPU/administrative access during any future such attacks.
- Additional QoS rules will be setup on the network core, to ensure routing protocols receive the highest priority; ensuring internal links do not “flap” during heavy traffic/attacks.
- Internal network circuits will be upgraded, to provide more internal capacity and protection against such attacks.

I hope this provides a satisfactory level of planned mitigation, and information to all affected customers. If you have any questions or concerns regarding this postmortem, please do not hesitate to contact me directly.

Regards,