03-Oct-2014 - Dist03.tor2 switch unavailbility - portmortem

Announcements concerning Networking & Related News, Planned Outages, Anything which may affect your services.

Moderator: Admins

Post Reply
porcupine
Site Admin
Posts: 703
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

03-Oct-2014 - Dist03.tor2 switch unavailbility - portmortem

Post by porcupine »

Hello Everyone,

The following postmortem is a breakdown of the specific timings, events, and details surrounding the incident on 03-Oct-2014 where the dist03.tor2 distribution switch went offline at approximately 4:10AM Eastern time for several minutes:

03-Oct-2014 (Eastern Time):
4:10AM – SNMP logs indicate the ISL's between core03.tor2 and dist03.tor2 have lost carrier. ISIS adjacency between core03.tor2 and dist03.tor2 is lost.
4:10AM – SNMP logs indicate the ISL's between dist03.tor2 and dist04.tor2 have lost carrier/changed state to down.
4:11AM – Network monitoring system sends PC staff the first of a series of alerts, indicating a PDU (connected to dist03.tor2) is down. PC staff begin investigating.
4:11AM – 4:14AM – A number of automated pages originate from the monitoring server, indicating other devices/monitored servers in the tor2 facility are offline.
4:15AM – SNMP logs indicate that the ISL links between dist03.tor2 and dist04.tor2 have changed state to up, traffic begins to flow again. It becomes evident that the issue involved dist03.tor2.
4:15AM – The monitoring server begins to report downed devices/monitored servers that connect to dist03.tor2 are now coming back online.
4:19AM – The last device the monitoring server saw as offline now shows as fully on-line. PC staff continue to investigate cause of incident and keep a close eye on dist03.tor2 in the process.

Preventative Measures:
Unfortunately there were few hints as to the source of the incident in any of the log files. Dist03.tor2 writes to an extended local log, remote SNMP server, and remote syslog. None of the entries indicated the cause for reboot (but all logs indicated various outputs during the restart process, indicating the logging was functional at that time).

We have ruled out a number of factors (local/physical/serial console access, power issues, cooling issues, etc.) as a result of our investigation. Since the logs did not capture a specific cause, we have increased the log level output to the syslog server, and will be monitoring it closely over the coming weeks, looking for anything out of the ordinary.
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
Post Reply