12-Jun-2024 - dist03.tor2 connectivity issue - Post-mortem

Announcements concerning Networking & Related News, Planned Outages, Anything which may affect your services.

Moderator: Admins

Post Reply
porcupine
Site Admin
Posts: 708
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

12-Jun-2024 - dist03.tor2 connectivity issue - Post-mortem

Post by porcupine »

Hi Folks,

Around 9:20pm on Wednesday June 12th 2024, the dist03.tor2 distribution switch/router started experiencing sporadic connectivity issues, which impacted a seemingly random ports/protocols/etc.

While not known at the time, this was the result of a flood of broadcast traffic, originating from within the network. Processing the broadcast packets put excessive strain on the dist03.tor2's supervisor card, resulting in the unexpected behaviour.

The following is a breakdown of the specific timing, events, and details surrounding this incident, timestamps are in Eastern (local) time:

Timeline of Events -- 12-Jun-2024:
9:18pm - 9:24pm - Monitoring system sends three [non-pager] emails indicating checks on internal services in Tor2 have failed.
9:28pm - 9:29pm - Monitoring system sends [non-pager] emails, for each of the above alerts, indicating all services "ok" (all clear).
9:30pm - PC staff notice the email alerts, and perform several checks, suspecting it may be normal activity (many internal system checks are known to periodically fail certain checks during normal operation).

9:33pm - Monitoring system sends another [non-pager] email, indicating a recovered service has failed again.
9:33pm - Staff make note in the internal chat channel, indicating something appears to be amiss inside Tor2's network.
9:33pm - 9:35pm - Several more email alerts for internal Tor2 services come from monitoring system.
9:38pm - Staff discover that two of tor2 HVAC's triggered comms alarms, indicating the two units lost communication (between each other). This raises immediate concern, as the HVAC's operate on a private, switching only network off dist03.tor2 (no routing involved).

9:38pm - Internal call is made, requesting more eyes on the problem.
9:43pm - A post is made on twitter, acknowledging something appears to be wrong with dist03.tor2.
9:48pm - Issues have been determined to only exist on ports served off dist03.tor2. Most ports are up & passing traffic, but random services/protocols are not getting through. Several impacted ports are bounced to see if that changes anything -- It does not.

9:49pm - Looking at the problem as a potential bug/hardware issue as there are literally nothing in the log files indicating a problem, staff discuss forcing the active supervisor to switch-over to the redundant supervisor. Currently there is little observable reduction in traffic originating from dist03.tor2, and < 25% of the monitored internal services behind dist03.tor2 are showing as down. Concerns that a forced switchover might make things worse, leads to an immediate dispatch of staff to site.

10:00pm - Discussing options while in transit, staff contemplate the possibility this is not a hardware failure, but some sort of incredibly low volume DoS attack.

10:05pm - Staff notice several CPU utilization metrics look strange (only ~25% of dist03.tor2's CPU usage is accounted for, but CPU keeps hovering near 100%). The investigation pivots away from assuming it's a hardware fault.

10:10pm - The monitoring system sends out a series of alerts, indicating "ok" status for all impacted hosts/services; Staff are still en-route to the facility.

10:20pm - Staff arrive in the Tor2 facility.
10:20pm - 10:25pm - A series of alerts from the monitoring system indicate services are once again impacted.
10:28pm - IPv6 routing to dist03.tor2 is shutdown, to see if this has any impact -- It does not.
10:32pm - Several debug options are enabled on dist03.tor2 & packets traversing the device are captured and analyzed, and a pattern is immediately apparent.

10:33pm - The pattern is narrowed down to a single customer port. The port is administratively disabled.
10:33pm - Monitoring system immediately starts sending alerts indicating services have returned to "ok" state.
10:35pm - The customer responsible for the offending port is contacted. The packets are originating from a catastrophically failed firmware update on a Cisco hardware firewall, the firewall is pushing ~750k packets/sec of broadcast traffic into the network, adversely impacting dist03.tor2's supervisor. This was not spotted during previous checks, as none of the packets traversed internal links, traffic didn't show up on our internal weathermap, nor volume alerts, as it was dropping at the customer interface.

10:44pm - Twitter post is updated to indicate the issue with dist03.tor2 has been identified and temporarily resolved.
11:02pm - IPv6 to dist03.tor2 is turned back up, having determined the cause was not related to IPv6 traffic.
11:00pm - 11:40pm - Staff remain on-site while said customer repairs their failed hardware firewall. Customer traffic is monitored, to ensure they get their gear back online, without causing any further incidents.

11:45pm - Staff pack up, and head out for the night.


Conclusions:
Looking back on the series of events in hindsight, the initial misdiagnosis of the disruption appreciably extended the time to resolution; As they say, hindsight is 20:20.

Historically, our network has rarely had to contend with impacts of "bad" traffic originating from within the network, as typically the core/distribution layers have the benefit of scale. Normally, badly behaving customer equipment/networks will fail long before there is any observable impact within our network, both due to configuration, and capacity of the equipment in question. As such, most filtering efforts concentrate on preventing external attacks from causing issues.

Considering this incident, we have taken several steps to help prevent/mitigate future incidents of this nature:
- Staff are actively contributing to, and taking a refresher in terms of diagnosing similar issues, to ensure we arrive at the correct conclusion more quickly.
- Additional monitoring has been configured for all distribution and core routers (both Tor1/Tor2) and fed directly to our pager, to ensure we're aware of any CPU related issues, before they present as a service issue.
- Policers & Rate-limiters that help protect the supervisor card(s) from being overwhelmed by unwanted traffic are having their values reviewed and adjusted in light of this event.
- Additional rate limiters and rules are being installed on all user facing ports to drop unknown unicast, and rate-limit how much broadcast & multicast traffic that is allowed to pass. Alerts are being added to the rate-limiters to notify staff when traffic is dropped.

As is our standard, we will be pro-actively issuing SLA credit(s) for customers whose services were impacted by this event.

Please let me know directly if you require further details, or have any questions, or concerns regarding either this event, post-mortem, or planned corrective measures. I'd like to extend apologies for the inconvenience caused by this incident.

Regards,
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
Post Reply