Hi Folks,
As some customers had detected, the dist01.tor1 and dist02.tor1 distribution routers experienced an increase in dropped packets, and network instability, beginning around 12:15am on 24-Nov-2019. The following postmortem is a breakdown of the specific timing, events, and details surrounding this incident, timestamps are in Eastern (local) time:
Timeline of Events:
12:11am - Inter-Switch Links (ISL's) between core01.tor1>dist01.tor1; core02.tor1>dist02.tor1 begin timing out/re-initializing (aka flapping) intermittently.
12:15am - The monitoring server experiences several probe failures in a row, and emails the emergency pager as a result.
12:18am - PC staff receive initial emergency page from monitoring systems and start investigating.
12:22am - PC staff post to maintenance/emergency twitter feed (http://www.twitter.com/pcolomaint), to confirm issue with dist01.tor1 & dist02.tor1 is known, and is being actively investigated.
12:29am - Several ISL's are manually disabled to try and reduce impact of flapping interfaces. Staff are investigating the possibility that maybe the flapping interfaces are perpetuating the cycle (IE: causing resource under-run/cascading failure).
12:32am - Initial customer complaint of issues on dist01.tor1 noting routing issues is received and responded to.
12:45am - Ports between core01.tor1 > dist01.tor1 are removed from port-channel (to make them direct point to point interfaces), in case the problem is some sort of port-channel related issue/bug. This change has no impact, ports continue to flap, investigation continues.
12:48am - PC Staff verify something is overwhelming the CPU on both dist01.tor1 and dist02.tor1, but not any other network devices. DDoS is suspected at this point, but staff cannot pinpoint a substantial increase of traffic on any interfaces.
12:57am - All counters are reset on the dist01.tor1 distribution router, in an effort to track down anomalous traffic patterns.
1:12am - Dist02.tor1 is intentionally isolated from the network to allow the CPU utilization to decrease/allow the router to internally stabilize, working under the theory the flapping may still be causing a resource under-run.
1:13am - Dist02.tor1 is removed from isolation once the CPU utilization is < 10%, ~45 seconds after initial isolation.
1:15am - Dist02.tor1 CPU ramps back up to >95% utilization with only one ISL active, strongly suggesting the ISL's flapping is not the primary problem, but a symptom.
1:20am - Debug mode is configured on Dist02.tor1 to manually capture packets hitting the supervisor, and look for anomalous traffic.
1:21am - Twitter thread updated as it's been ~1hr. since last update.
1:28am - After brisk analysis of packet captures, a discernible pattern is observed. Traffic is confirmed to be a DDoS attack, which is not hitting any particular interface, but randomly hitting all routed interface addresses on the dist01.tor1/dist02.tor1 routers (all network, broadcast, gateway addresses seemingly at random). Packets are using protocols that should not have impact to the router (IE: not actively used/not supposed to be listening, but clearly something is).
1:30am - Access-list rules are put into place to filter additional traffic types from external interfaces [our transit/public peering circuits], the ISL's stop flapping almost immediately, and the attack appears to be mitigated.
1:45am - PC staff are reasonably certain the issue has been resolved, and post a confirmation to the Twitter thread indicating this.
[...]
4:20am - A second/new attack is detected, some malicious traffic still making it past filters. CPU load on dist01.tor1/dist02.tor1 rises to ~30%, before dropping back down below 25%.
4:45am - The second attack subsides, no impact to customer traffic was detected, no dropped connections registered across any network devices.
- Throughout this incident, PC staff were perpetually checking on CPU/memory utilization, hardware capacity indicators, interface counters, etc. scanning and searching for probable causes, and attempting to understand and limit impact as best possible. While the ISL's were dropping repeatedly, since each distribution router has several ISL's, traffic was typically only bouncing back as unreachable when multiple ISL's were down at the same time (or traffic was bounced from flapping interface to flapping interface repeatedly), preventing the distribution routers, and customer devices from being "hard down" generally speaking.
- After the attack subsided, the packets captured indicated that virtually all of the traffic originated from a single low-cost "developer cloud provider". We're not certain as to why a botnet would exist on a single cloud provider, but intend to follow-up accordingly.
Preventative Measures:
Unfortunately with any Denial of Service attack, it's always difficult to provide definite answers due to the nature of the event. In this instance, there was no specific customer or resource targeted. While originating from a single protocol, the attack traffic was massively distributed both in source, and destination, and not limited to a single device, IP range, etc. While we have added additional measures to our external Access List configuration to prevent a similar attack from impacting traffic, we are still considering other pro-active measures which may be taken to help mitigate such threats in the future.
We will likely be upgrading the supervisor cards in the distribution routers in the near future to add additional processing power to the supervisors. While this action does not directly mitigate nor prevent attacks such as this, it does provide more processing power to help buffer against scenario's such as this. Any such upgrades will be performed after-hours, with plenty of advance notice, as always.
Please let me know directly if you require further details, or have any questions, or concerns regarding either this event, or this postmortem. Once again I'd like to extend my apologies for any inconvenience caused by this event.
Regards,
24-Nov-2019 - dist01.tor1/dist02.tor1 network issues - Postmortem
Moderator: Admins