08-14-2003 4:15pm Onwards, Continent wide power issues
Posted: Fri Aug 15, 2003 8:45 am
Hello everyone,
Well it seems this is just not a good month in general.
At 4:15PM EST, power failed across most of the north eastern states, and most of ontario (which contains more then half of Canada's population). At the time, the UPS systems kicked into place immediatly, as did some of the generators down at 151 Front Street. After about 45 minutes, one of the UPS chains died out completely, causing power failures to several servers, and some of our swithces/routing gear.
At the time, equipment was moved over to circuits on a second battery chain. Shortly after, another unrelated battery chain in 151 Front reached critical temperatures, and was shutdown. At this time, our peer1 link disconnected, because this chain was powering a DC power plant, which their core Cisco GSR 12,000 router was on [On a sidenote, i asked why not have one AC power, on DC, instead of redundant DC power supplies, and apparently it creates noise in the chassis]. Shortly thereafter, 360 Networks/Group Telecomm's fiber died, apparently their equipment was shutdown (unclear as to exact reason/nature, as they're in a different location).
At this point, everything went down, for several hours. At this point, Peer1 decided to route us through a different router out in the 1 Young facility (as their pop>pop fiber was still up VIA their Cisco 3500 series switching gear). This got us back online for several more hours, but with a few hiccups (peer1 lost their cable and wireless link down in NYC, and their global crossing link as well), as a result, there was signifigantly heavier traffic going through their Toronto POP, and several reboots were done to the switches, route adjustmenets, etc.
During the wee-hours of the morning, their connectivity down in NYC decided to come back online, and routes began to return to normal. Additionally, DC power was restored to their router, and our connectivity through peer1 returned to normal. After a few routing issues, and some tweaking, all traffic is currently flowing smoothly out peer1. The Istop network is currently 100% down, Peer1 is up, and seemingly so without too many issues.
Traffic will continue to route exclusively through peer1's connectivity until the Istop network gets back online, at which point there will be some slight hiccups, as the routes switch back over to their normal route. Thats the long and the short of it, should be interesting to see how this catastrophe pans out longterm.
For anyone whose server was on the first UPS string and got needlessly rebooted, we're sorry, and we hope you didn't loose any uptime records .
Well it seems this is just not a good month in general.
At 4:15PM EST, power failed across most of the north eastern states, and most of ontario (which contains more then half of Canada's population). At the time, the UPS systems kicked into place immediatly, as did some of the generators down at 151 Front Street. After about 45 minutes, one of the UPS chains died out completely, causing power failures to several servers, and some of our swithces/routing gear.
At the time, equipment was moved over to circuits on a second battery chain. Shortly after, another unrelated battery chain in 151 Front reached critical temperatures, and was shutdown. At this time, our peer1 link disconnected, because this chain was powering a DC power plant, which their core Cisco GSR 12,000 router was on [On a sidenote, i asked why not have one AC power, on DC, instead of redundant DC power supplies, and apparently it creates noise in the chassis]. Shortly thereafter, 360 Networks/Group Telecomm's fiber died, apparently their equipment was shutdown (unclear as to exact reason/nature, as they're in a different location).
At this point, everything went down, for several hours. At this point, Peer1 decided to route us through a different router out in the 1 Young facility (as their pop>pop fiber was still up VIA their Cisco 3500 series switching gear). This got us back online for several more hours, but with a few hiccups (peer1 lost their cable and wireless link down in NYC, and their global crossing link as well), as a result, there was signifigantly heavier traffic going through their Toronto POP, and several reboots were done to the switches, route adjustmenets, etc.
During the wee-hours of the morning, their connectivity down in NYC decided to come back online, and routes began to return to normal. Additionally, DC power was restored to their router, and our connectivity through peer1 returned to normal. After a few routing issues, and some tweaking, all traffic is currently flowing smoothly out peer1. The Istop network is currently 100% down, Peer1 is up, and seemingly so without too many issues.
Traffic will continue to route exclusively through peer1's connectivity until the Istop network gets back online, at which point there will be some slight hiccups, as the routes switch back over to their normal route. Thats the long and the short of it, should be interesting to see how this catastrophe pans out longterm.
For anyone whose server was on the first UPS string and got needlessly rebooted, we're sorry, and we hope you didn't loose any uptime records .