22-Mar-2005 - Core1 Flaps, 5:30PM, and 7:30PM EST

Post by **porcupine** » Tue Mar 22, 2005 8:33 pm

Hi guys,

As you may (or may not) have noticed, our Core1 router has flapped two times in the past three hours. The linecard VIP #4 is clearly failing (or at least its memory is), as some may remember was happening late last year (during the holiday season).

We had thought this to be resolved with an IOS upgrade (wasn't clear it was a hardware fault at the time), which settled the problem for several months (~3.5 months), until today. Its clear that the hardware on the VIP4 interface card is dying, and thus we will be performing some emergency maintenance tonight swapping out VIP4 with a new linecard. Customers can expect to see BGP re-routing around our TorIX connection through alternate paths during ths period, as our TorIX connection passes through this VIP.

Please email admin@prioritycolo.com should you have any technical, or general questions. We carry numerous onsite spares for this linecard, so there should not be any issues with this replacement.

Post by **porcupine** » Tue Mar 22, 2005 10:16 pm

Hi guys,

Well Murphy's law as always proves correct.

As you will note in the maintenance posts, last night Core2 was replaced with a new switch/router with additional capacity titled MSFC01. Normally there are connections present between core1 > core2, core2 > sw01 ,core1 > sw01, and various other ones (but since most people are on sw01, thats the one that matters). This forms a basic triangle pattern if you draw it out on paper, premise being that if any link goes down, theres 2 alternates still to take.

It would appear that last nights maintenance did not go without a hitch as previously thought. There were problems with the link between mfsc01, and sw01. Thus when core1 failed in this instance, msfc01 had no way to communicate with sw01, and customers on sw01 thus lost connectivity. While several customers have connectivity directly to msfc01, the vast majority do not. This was simply a configuration error, but what resulted produced a cascade effect:

While swapping in the new linecard for VIP4, the error log indicates that it did not seat correctly. This in turn crashed the VIP card being inserted, which then in a cascade effect crashed the 7507 router. While the router failed over to the secondary RSP (redundant Routing modules), this had no effect, as the other VIP's and sub-interfaces disabled due to the bad connection on the bus. This resulted in core1 going offline until the trouble could be diagnosed, the card removed, and the router hard-booted.

Everything was back up at 9:01pm EST, after suffering a rather troublesome 20 minute downtime. We are presently working to scan the configurations of core1, msfc01, and sw01 before re-attempting this maintenance.

We will be re-attempting the replacement of this VIP card within the next 2 hours, and will likely simply turn down the connections to core1 beforehand (in case of catastrophic failure), while routing everyone out of the msfc01 router until the VIP has been replaced, at which point we will return the network to its regular configuration, and redundancy.

I sincerely appologise for any inconvinience this may have caused, and hope that everyone can understand. PC's network has been operating with next to no downtime for over a year at this point, and unfortunatly, human error does happen. We will continue to strive to add further network diversity to our network (with a new 7206 VXR router thats scheduled to be deployed within the next month, as it was shipped a few days ago), and continue to provide the same quality of service I sincerely hope everyone enjoys and appreciates.

Regards,

Post by **porcupine** » Tue Mar 22, 2005 10:41 pm

We are now gracefully diverting traffic from core1 > msfc01 to ensure that if core1 goes down again while re-inserting the VIP card, we do not experience a bunch of BGP flaps. People will notice traffic routing exclusively over Peer1 shortly as we stop announcing our routes temporarily over NAC and Teleglobe.

Post by **porcupine** » Wed Mar 23, 2005 4:42 am

Just to conclude,

Everything is now back up and running 100%. On the second attempt, core1 accepted the new VIP #4, without any issues, at which point we were able to turn back up the BGP sessions, and allow traffic to flow back over core1.

The connectivity between core1 > msfc01, msfc01 > sw01, core1 > sw01 have all be restored, and the network is once again operating at full capacity, with the redundancy features in place.

Once again, I appologise for any inconviniences the outage last night caused.

Regards,