07-Nov-2010 - HVAC incident post-mortem

Announcements concerning Networking & Related News, Planned Outages, Anything which may affect your services.

Moderator: Admins

Post Reply
porcupine
Site Admin
Posts: 704
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

07-Nov-2010 - HVAC incident post-mortem

Post by porcupine »

Hello Everybody,

As many of you are aware, on 07-Nov-2010, Priority Colo’s 818 suite experienced issues with our HVAC systems, which resulted in the suite reaching unacceptably high temperatures, impacting customer services, causing equipment to thermal throttle, some customer equipment to thermal shutdown, etc.

Summary:
At approximately 11:30am EST, 3/3 HVAC units in the 818 suite stopped delivering cold air to the suite. With nothing removing the excess heat, temperatures in suite rapidly climbed to unacceptable levels. Temperatures stayed at very high levels for ~2.5 hours. Temperatures returned to typical (very cold) levels ~1 hour later.

Short/Current Diagnosis:
- HVAC #3 is believed to have a faulty control/actuator on the chilled water valve. This is being actively investigated, and will be replaced as necessary.
- HVAC #1 and #2 firestop breakers were tripped due to high return air temperature. Default firestop setting is 100 Deg. Fahrenheit (37.8 Deg. Celsius). Tripped firestops prevented units from restarting, misleading error text in system log/unit’s control panels disrupted emergency troubleshooting.
- HVAC #1 and #2 are having their actuators physically inspected, Wintech Air Systems, Liebert Canada, and the building management are all running diagnostics on all related systems in an attempt to isolate the initial cause of the elevated temperatures. The issue was specific to the 818 suite notably.

Future Changes:
- PC will be implementing a user-subscription facility mail list. Current mailings (such as this) go to all customer contacts of a specific type (IE General/Tech), and have often generated more harm than good when sending out reports of incidents in progress (influx of additional inquiries, panicked customer responses, demand for unavailable levels of detail, etc.). We believe an optional - customer subscribed facility related mailing list will alleviate support load during any future incidents, hopefully allowing more rapid issue resolution, and help distribute information more efficiently to ensure customers are kept in the loop, and react suitably in such scenario’s.

Prevention:
- Of the three HVAC’s, The 818 suite needs a minimum of 1 working unit to operate within overall tolerable levels, and requires 2/3 units to maintain optimal temperature, and airflow numbers. PC has obtained permission to permanently disconnect the actuators on one of the units, effectively ensuring that it will receive a consistent supply of chilled water (maximum cooling for one unit), despite any issues with the unit controls, with the possibility of doing this on a second unit if deemed appropriate.
- HVAC #1 and HVAC #2 will have their “firestop” mechanisms permanently disabled. These mechanisms are not required by local fire code, nor are they necessary given our fire suppression configuration. HVAC #3 does not have a “firestop” mechanism.
- Additional, high amperage electrical outlets will be added near the suites main doors, to facilitate more rapid deployment of emergency cooling measures.

Detailed step by step series of events:
10:38am - HVAC3 logs indicate a rise in return air temperature.
10:41am - HVAC2 logs indicate a rise in return air temperature.
10:48am - HVAC1 logs indicate a rise in return air temperature.
11:31am - Various high temperature equipment alarms begin to generate alerts.
11:37am - Remote readings suggest that HVAC #1 and #2 have shut down. PC staff is immediately dispatched to the site.
11:50am - Security enters 818, doors are propped open for ventilation as requested.
11:53am - HVAC #2 is visually inspected, determined not to be actively cooling, and power cycled to clear any software conditions & alarms on the unit. No effect.
11:57am - PC staff arrive onsite, and begin troubleshooting HVAC #1 and #2; Cryptic status message on both units indicating “remote shutdown”.
12:05pm - Various attempts to diagnose and restart HVAC #1 and #2 fail. T-Fab Mechanical & Wintech Air systems are emergency paged by remote staff to head to 151, as priority service calls.
12:10pm - Portable 48” fans are brought to the suite, to provide emergency cooling, and reject hot air into the surrounding hall space.
12:21pm - HVAC3 is discovered to be misreporting its chilled water valve position. HVAC #3 reports 100% open valve, however chilled water meter numbers indicate valve is 0% open. HVAC #3 is manually restarted. No effect.
12:50pm - HVAC3’s chilled water valve opens (cause unknown), and begins to remove large volumes of heat from the room, the temperature in suite 818 begins to drop.
12:55pm - 1:10pm Core network gear reports that inlet temperatures have returned to normal operating temperature range. Suite is still hot, but many devices are no longer in critical state.
1:50pm - 2:30pm T-Fab Mechanical staff arrives, CW valve actuator is removed from HVAC #2 (CW valve now 100% open). HVAC #2 put into manual diagnostic/test mode, and main fan is set to “on”. No effect.
2:45pm - Wintech Air staff arrive and begin diagnosing HVAC #2. Attempts to start the unit fail. Unit is started by temporarily rewiring around the main control board, directly between the fan motor and the input power. With 2/3 HVAC’s running at 100%, suite temperatures plummet back to normal ranges within minutes.
3:05pm - PC staff begin attending to customers inquiries regarding servers that need reboots, and other server specific problems. All reseller/hosting servers are rebooted as an emergency measure to clear present errors and issues.
3:00pm - 4:25pm - Wintech Air staff diagnose remaining HVAC #1 (still offline), in an attempt to find the initial issue, and why the units will not restart properly.
4:25pm - 5:00pm - Toronto Fire receives false report of fire in 818. 4 fire trucks arrive at 151 front, repair & resolution is interrupted as a result.
5:15pm – Wintech Air rep discovers that “firestop” breakers have been tripped in HVAC #1 and HVAC #2, due to high heat. Firestop breakers are reset; both HVAC’s are now operating properly. Explanation of why neither HVAC would start properly is complete. Investigation into how the breakers could have both tripped begins. Long-term investigation begins.

Conclusions:
The 818 suite was built with a very reasonable level of HVAC capacity, and redundancy in place. While we realize that effectively turning one HVAC unit into an “always on” system results in a less granular level of control over suite temperatures, and will likely develop a few cold-spots, we believe this will effectively mitigate the risk of such events recurring, and we feel this is the preferable option. PC staff are continuing to follow up with Liebert Canada (HVAC Manufacturer), building operations, and our HVAC maintenance/service personnel in a long term investigation attempting to further isolate the initial causes of this incident. We apologise for the inconvenience this has caused many customers, and are taking every reasonable step to mitigate the possibility of such incidents from occurring in the future.

Regards,
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
porcupine
Site Admin
Posts: 704
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

Post-Mortem update

Post by porcupine »

Hi Guys,

We have a significant update regarding this incident. Considerable new information has been made available to us upon the completion of our vendor investigations.

To Recap - Preventative Actions Taken:

- The Actuator has been disabled on the HVAC #3 unit, which means that it will always provide cooling to the full extent of its capacity, regardless of facility conditions. There is sufficient heat load surrounding HVAC3 to ensure that the surrounding area should never be over-cooled.
- The "firestat"/"firestop" breakers have been permanently removed from the HVAC #1, and HVAC #2 units, ensuring that they will never again disable the fan motors, a major contributor to this event.
- Airflow/perforated tiles in the 818 suite have been modified, to provide additional airflow to the "E", "F", and "G" aisles, by removing excessive airflow from the A/B/C aisles.
- APC Blanking panels will be provided to all cabinet customers, to help further isolate the hot/cold aisles, in an effort to increase overall efficiency, and facility resiliency. The hot aisles will become warmer, and the cold aisles will become colder. While this may be less comfortable for some people, customer equipment should be more comfortable, with less hot/cold air mixing. Cabinet customers will be directly contacted regarding this for installation permission in the coming months.
- We have setup a twitter feed, to provide more detailed, and granular information, during maintenance, non-impacting events, and also future incidents/outages. This can be accessed using http://www.twitter.com/PColoMaint . Note: this is an additional source of information for customers desiring a more informal, granular view of facility events, not a replacement for any of the current sources/contact methods.

New Information - Initial Cause:

Building management isolated the initial cause of this incident during their investigations. In an email we received from management, the cause was attributed as follows:
- Sunday morning, unknown to building management, a building/customer mechanical contractor was performing a modification to the end of the 8" center chilled water riser, modifying a new 3HP booster pump to stabilize the flow of water on that particular chilled water loop under the guise of routine maintenance work.
- Upon energizing the pump, without any VSD (Variable Speed Drive), it caused a drop in loop pressure, which resulted in Priority Colo's unit experiencing an instant loss of chilled water flow. Priority Colo's suite receiving little/no chilled water began to overheat, until the HVAC units firestat/firestop switches tripped.
- The contractor responded to the buildings page, and shut down the new pump – Flow to the 818 suite was stabilized. The pump was brought back online in a metered capacity, and remains in this mode to date. This will be fine tuned over the coming weeks to adjust the chilled water flow levels to 818, which are currently lower than normal, but not impacting facility operations.

Prevention:
- 151 Front St. W. management has admitted responsibility for the issue. The contractor in question has been suspended from working in the building, and their company will no longer be allowed to work in 151 Front St. West without a full written plan of action approved by building management, and a chaperone while working.

We are closing this incident as resolved. We now have a full explanation of the events which took place, reasonable measures in place to help deter this from ever occurring again, and have learned as much as we can reasonably expect from the events which took place.
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
Post Reply