09-Dec-2019 - www2.pcdc.net server hardware failure

Post by **porcupine** » Fri Dec 13, 2019 5:18 am

Hi Folks,

As many shared/reseller hosting customers on the www2.pcdc.net server had directly observed, the server in question experienced a significant outage, the result of a catastrophic failure of the servers RAID1 array which hosts the operating system.

The following postmortem is a breakdown of the specific timing, events, and details surrounding this incident, timestamps are in Eastern (local) time:

Timeline of Events:
2:09am - Monitoring system sends out an emergency page to staff, indicating HTTP on www2.pcdc.net is not responding.
2:11am - Staff login to KVM over IP device to investigate. Server appears to have partially locked up.
2:15am - Server is rebooted, and starts booting back up.
2:25am - Server is back online, investigation begins as to why the server had locked up.
2:35am - Investigation points to a problem with the main operating system RAID array, speculation is that when the failed drive dropped out of the array, something went wrong, resulting in the system getting rebooted.
2:55am - It's determined the drive is definitely bad, decision is made to replace the drive immediately as opposed to waiting.
3:35am - Staff arrive on-site in Tor1 facility, locate and remove/replace the failed drive. Rebuild commences.
3:55am - RAID rebuild is > 25% complete, staff head home, no further intervention should be necessary at this point.
4:10am - Emergency pager goes off again, http is once again down. It's not immediately caught due as staff in transit.
4:20am - The emergency page is noticed, and staff login to KVM over IP again to check on what is happening. It appears the server has locked up mid-rebuild. Remote diagnostics commence, as one drive has already been replaced (and the second can't possibly be replaced, until the data has been mirrored onto the new SSD). Attempts to coax the drive into rebuilding begin.
4:45am - Staff give up on remote troubleshooting, and return to Tor1 facility to troubleshoot directly (with the previously pulled bad drive in hand, intent on possibly trying to rebuild the bad array with both bad drives, to at least get the system back online before proceeding to additional efforts).
5:03am - Staff arrive on-site and continue troubleshooting.
5:14am - Initial twitter post is made to our maintenance/emergency twitter feed (www.twitter.com/pcolomaint/), announcing www2.pcdc.net server has been pulled offline due to issues with the RAID, troubleshooting continues.
8:30am - After several hours of troubleshooting, it's become clear both SSD's in the RAID1 set have critical issues in the hardware itself. Twitter post updated to confirm the server is down due to catastrophic hardware failure, attempts to coax the drives to rebuild continue.
9:55am - After numerous failed attempts to coax the RAID into rebuilding, taking over half an hour for each attempt, all hope of rebuilding the existing array is lost. Twitter post is updated indicating server will have new RAID1 array installed, CentOS reinstalled, and data mirrored off backups as necessary. The RAID disks have already been replaced, and the process has already been started.
10:05am - Staff take the opportunity while CPanel is installing, to return to office, and leave the Tor1 facility.
10:25am - OS has been successfully reinstalled on RAID1 consisting of two new drives. CPanel has been installed, services start coming back online.
10:59am - Twitter updated to confirm OS and CPanel have been reinstalled successfully. Work has already commenced on assessing current state of the system, and configuring CPanel/WHM again to mirror original configuration as best possible.
12:47pm - Tickets are filed with CPanel.net, when staff realize that CPanel backs up a slew of system files (specifically including its own configurations, settings, etc.) but there is no apparent tool/restoration method documented to restore said files. This can't be right...
1:27pm - CPanel responds to ticket, verifying that no such restore program exists. Yes, you read that properly.
1:27pm - CPanel staff point to a document with vague instructions on "settings copying tools" that aren't included in the backup routine, and only work on a functional server. It becomes very evident a slew of adjustments will have to be made by hand, hampering restoration efforts. Heated emails are exchanged as this is clearly a lazy cop-out on CPanel's part, and not what is expected of such a mature product. Additional ticket responses are continuously exchanged to quickly troubleshoot the inevitable issues that result from the lack of formal procedure/written documentation.
1:52pm - Twitter updated to verify that files/settings/etc. are still being restored by hand. Services are being checked & configured by hand, on a service by service basis.
3:19pm - Twitter updated to verify that all services, with the exception of MySQL should be up and running in a normal manner.
4:13pm - Twitter updated to verify MySQL is once again online, and the system has been normalized. Users are asked to report if any additional issues are observed. Staff continue to work in the background to tweak and verify settings in WHM.
[...]
10-Dec-2019 - When reviewing the server log files, it's observed that a number of accounts have not backed up previously since the server has been rebooted due to previously hidden MySQL database errors in the given customer accounts. Customers are notified of this issue (preventing their backups from completing), and offered to have the MySQL databases restored to 08-Dec-2019's backup by our staff, to be provided a root level backup in order to restore themselves, or provided the opportunity to correct the discovered issues through other means.

Several customers inquired about the redundancy/backups/etc. on the reseller servers, so a few quick notes to clarify questions received:
- All of the customer facing drives on both reseller servers use hardware RAID1. The operating systems are on SSD's, while the home directories are on SATA disks. This provides a good mix of performance, and space, where needed.
- The RAID arrays continuously check data for consistency on any data read, this is one of the primary methods for detecting drive errors/failures. When drive failures do occur, all of the chassis use hot-swap drive trays, and drives are typically replaced without any interruption to services.
- The RAID controllers also run weekly consistency checks, every Saturday morning @3am, where they compare each drive in the array, and if inconsistencies are found, they are corrected (or failures are logged/alerted/etc.). This would normally detect issues with any part of the drive, regardless of it being normally accessed [IE: effectively checking idle space for errors/etc.]).
- Several copies of both Daily, and Weekly (3/ea) backups are stored on the server side. Backups synchronized daily to an on-site backup server (which also retains 2 copies of monthly backups). Every week, the on-site backup server mirrors all of its backups to an off-site backup server (in our Tor2 facility, in Markham, Ontario.).

Preventative Measures:
- Both SSD's in the RAID1 that hosts the core operating system have been replaced with brand new drives. The RAID1 has been rebuilt, and consistency checked accordingly.
- A feature request was put in with CPanel, to incorporate the WHM settings copying/backing up tool, into the regular system level backups.

Please let me know directly if you require further details, or have any questions, or concerns regarding either this event, or this postmortem. Once again I'd like to extend my apologies for any inconvenience caused by this event.

Regards,