www5

Technical support for webhosting Resellers, and questions which directly relate to their services.

Moderator: Admins

Post Reply
sbrook
newbie
Posts: 9
Joined: Thu May 10, 2007 11:45 am

www5

Post by sbrook »

Seems that www5 is up and down like a yoyo the last few days and gradually becoming more "down" than up!

What's happening guys?
porcupine
Site Admin
Posts: 703
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

Re: www5

Post by porcupine »

Hello sbrook,

Indeed you are right. The new www5.pcdc.net server has been causing some pretty serious administrative frustration over the past week.

While all of the new reseller servers are incredibly under-loaded (as the servers individual status page clearly reflects), for unknown reasons the www5.pcdc.net server has been crashing at random. We have been trying a number of methods to narrow down the cause of this, including new kernels (it's now running the latest, compiled from scratch), monitoring the KVM output during crashes, and even a full hardware swap (the other night, when www5 crashed, and we were standing just a few feet away from it, we pulled the drives, and swapped the drives to a brand new/identical rig [obviously figuring the problem to be hardware related].

Unfortunately at this time, it seems that the problem isn't hardware, nor configuration (as all of the reseller servers are running identical OS configuration, and identical hardware). We've setup a few measures to try and narrow down the issue, and hope to have it resolved shortly. Unfortunately when the server crashes, it does dump output to the monitor, but the error message is so long, the useful details scroll off the screen, and the output that remains is of no value (you cant scroll at that point/page up).

We've just setup a serial configuration, dropping the output to a console server, to try and grab the error messages so this can be suitable diagnosed. Until then, all we can do is jump on the server as quickly as we see problems (which is why you see a bunch of 2-5 minute outages, thats how fast we're getting someone to a console to resolve it).

Aside from that, if your business has been impacted (or any other customers), please dont hesitate to contact me directly (myles@), and we can process a credit for this months service, as this does not reflect what the new servers are intended to deliver obviously :).

[moved to the reseller forum].
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
sbrook
newbie
Posts: 9
Joined: Thu May 10, 2007 11:45 am

Re: www5

Post by sbrook »

Thank you for the comprehensive update.

Been there, done that, got the bruises to show for it! And I absolutely know the "ooops, the error message went away" problems ... so very frustrating ... either a million errors occur after and push what you want out of the way, or as in this case, it's on the right hand invisible screen!

It certainly was smelling of hardware ... but if you've done a bulk hardware swap, then it sure isn't that!

Good luck! Don't need a credit yet, but what have been much appreciated would have been a pro-active note in here to say "we're working on problems with www5" so I didn't have to feel like I was bugging you with this post. You guys are normally so good about all other notifications for maintenance.

Thanks!
porcupine
Site Admin
Posts: 703
Joined: Wed Jun 12, 2002 5:57 pm
Location: Toronto, Ontario
Contact:

Re: www5

Post by porcupine »

Indeed, we did manage to trap an error on this mornings reboot from the serial setup, and the machine check exceptions do indicate it might be hardware after all. The mcelog is empty (go figure), so we've got nothing specific to go on, but the additional information from console does show the motherboard model, various CPU/[memory?] bank locations, and state that "this is not a software problem, contact your hardware manufacturer".

We will be scheduling maintenance for tomorrow afternoon/evening (~5-10 minutes) to swap the motherboard with a newer revision that came in our latest parts batch this morning, and if that fails to have effect, I've picked up another batch of memory that will be swapped in Sunday (half of the servers of this spec run Kingston, the other half run Samsung.), so if the motherboard swap is ineffective, we'll swap this server over to Kingston.

Normally we only post updates when we have a plan of action/diagnosis. Unfortunately past experiences have taught, telling people effectively "we dont yet know whats wrong, how, or when we're going to have it fixed" is a very bad idea in general (hence why we've been working on/notifying relating to this issue in an incident responsive manner, instead of a pro-active manner). This is one of the few cases, where the responses reflect the issues, more then the general policy :).
Myles Loosley-Millman
Priority Colo Inc.
myles@prioritycolo.com
http://www.prioritycolo.com
sbrook
newbie
Posts: 9
Joined: Thu May 10, 2007 11:45 am

Re: www5

Post by sbrook »

OK ... I think I can understand why ... you get smart people pestering you a lot, and / or you get lots of backseat drivers. (quarterbacks in the stands!)
Post Reply