08-Jul-2013 & 10-Jul-2013 Incident Report

Announcements concerning Networking & Related News, Planned Outages, Anything which may affect your services.

Moderator: Admins

Post Reply
porcupine
Site Admin
Posts: 703
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

08-Jul-2013 & 10-Jul-2013 Incident Report

Post by porcupine »

Hello Everyone,

As most of you are aware, on Monday July 8th, 2013, Toronto experienced the largest rain fall in recorded history (breaking the previous 24 hour rainfall record in a mere 7 hours). This resulted in a massive blackout, flooding, failure of the Enwave chilled water plant, and general havoc throughout the city. This impacted Priority Colo's 818 suite, when our chilled water supply experienced a sharp and sustained drop in available capacity, creating a significant loss in cooling capacity. We believe this is directly related to the second incident we experienced on the 10th of July.
Order of events:

Monday July 8th, 2013
5:57pm - Commercial power is lost throughout downtown Toronto, Enwave’s chiller plant and island filtration plant/pumping station go offline.
6:12pm – Commercial power to the suite fails; UPS-A and UPS-B begin operating on battery power.
6:12pm – Emergency generators start, and power is diverted automatically. PC staff monitor the situation remotely.
6:19pm - Enwave starts emergency generators to provide backup power to island pumping station, generators are immediately shut down as the pump controls are flooded.
6:25pm - Storm conditions worsen, PC staff are in communications with building staff, and are dispatched on-site to monitor situation directly.
6:50pm – PC staff arrive on-site and begin efforts to address cooling issues. Chilled water supply over-temperature, and low flow are detected.
7:00pm – Cold aisle temperatures increase to 84-92 degrees (Fahrenheit). Announcements are posted to emergency/maintenance twitter feed.
7:20pm - Enwave brings multiple pumping stations, and their chiller plant back online.
7:23pm - Commercial power is restored, power is diverted automatically, generators begin their cool-down operation.
7:30pm – Chilled water supply temperature peaks, as does suite temperature.
7:45pm – Chilled water supply temperature begins to drop; suite begins to cool from critical temperature.
8:30pm - City of Toronto instructs Enwave to shut down their pumping station, as contamination is entering the city potable water supply. Temperatures in the suite increase.
10:10pm - Enwave brings 7,300 tons of additional emergency chilling capacity online, suite temperature begin to drop again.
11:39pm - Enwave resumes normal operation of the chilled water loop, and temperature levels continue to drop.
12:15am - Chilled water supply has stabilized, and the suite continues to gradually cool.
1:00am - Suite temperatures have returned to normal, all clear notice is sent out. Staff remains on-site for several hours to handle any customer concerns.

Wednesday July 10th, 2013
6:30am - Chilled water supply/flow to suite begins to very slowly decrease, supply temperature remains constant.
7:40am - PC staff are alerted to rising temperature in suite, and begin remote investigation.
7:45am - PC staff are dispatched to site, and in communication with building troubleshooting the issue.
8:00am - Building staff enter suite, and place several fans for emergency cooling.
8:10am - PC staff arrives on-site and begin investigating the situation.
8:25am - Suite temperature peaks, as building investigates loss of flow. PC staff point to suspected failure of 8th floor chilled water booster pump.
8:40am - Building staff determine that a fuse in the control panel for the 8th floor booster pump has blown, taking pump offline. Fuse is replaced and chilled water flow immediately increases.
9:00am - Temperature in the suite decreases to normal levels. PC staff remains on-site for several hours to assist customers, and monitor the situation as appropriate.

Summary:
Due to a variety of factors (power density, available roof space, economics, etc.), nearly every provider within 151 Front St. relies on the chilled water loop provided by the landlord, to provide heat extraction for their suites. The building relies on Enwave to provide the chilled water service, which draws from Lake Ontario, and extracts cold water VIA heat exchanger from the city tap water supply, as it is pumped and processed through the island filtration plant. As such, virtually all providers rely on this system, and are thus at risk when/should it fail. While there are multiple redundancies in place (both in terms of pumps inside the building, within Enwave's loop, backup [non-lake water] chilling stations, etc.) it is obviously possible for such redundancies to become overwhelmed in disaster situation.

Having said this, we've followed up with building management to determine the best course of action, and preventative measures going forward, as such incidents should be avoidable.

Enwave's Preventative Measures:
- Enwave has agreed to allow 151 Front to put a backup pump into the Enwave side of the chilled water loop. This will allow the building to independently pump chilled water through the Enwave loop (and from the Enwave cooling reserve) in the event of a failure on Enwave's system. This will provide up to 60 minutes of chilled water in the event of a catastrophe. 151 Front management has indicated this project will commence later this month, with an expected completion for October of this year.

Allied Realty/151 St. Preventative Measures:
- 151 Front management has indicated that they have been planning for some time a 2N redundancy solution to address the potential of Enwave failing. The building has constructed a new hydro vault to allow for additional power capacity, has purchased two new diesel generators, and a new chiller plant which will provide mechanical chilling of the building chilled water loop in the event of any issues within the Enwave cooling system. They have indicated a tentative start date of 07-Sep-2013, and are expecting completion by this years end.
- 151 Front management has committed to installing early warning flow sensors in the booster pump that feeds the 8th floor south chilled water riser to quickly alert in the event of future incidents.

Priority Colo's preventative Measures:
- Priority Colo staff will continue to tweet updates of emergency and maintenance events to our twitter feed ( http://www.twitter.com/pcolomaint/ ), as this has proven effective.
- We have purchased a pair of portable industrial cooling fans to ensure that in the event of a cooling issue, resources to assist in cooling/air management are immediately available.
- Priority Colo is now also able to provide geographic, and facility redundancy options, having just constructed a new, redundant facility in Markham, Ontario, which can provide sensitive clients with an alternate location, and alternate infrastructure to what can be offered at 151 Front St., while still maintaining both redundant Layer2, and redundant Layer3 connections back to 151 Front st./existing gear.

Conclusions:
While we may have weathered the storm with some of the best possible outcomes, I firmly believe that we can still come away having learned some valuable lessons. While the root cause and the majority of failures were not within systems that Priority Colo exercises any direct control over, the involved vendors have openly acknowledged the issues at hand, and have committed to major changes which should provide the level of system resiliency we reasonably expect for the services in question. It is my sincere hope that the detailed description provided herein helps to answer the questions, and alleviate the concerns which have formed during these incidents.

Customers with any questions/concerns/etc. with regards to the events of this week, or the proposed measures should contact me directly, as always.

Regards,
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
Post Reply