Yesterday, we had some critical maintenance carried out on few of our Linux servers and during the procedure, the services were offline and few of our customers experienced extended downtime.
So what exactly happened?
We needed to do a critical kernel patching and run security updates on the server, in the wake of a latest publicly disclosed Linux security vulnerabilities, which is currently being utilized in public-facing Internet worldwide. Currently, no active attacks are reported to our server, but these vulnerabilities were termed “Very Critical” by our Security Engineers and hence we have no choice but to do an immediate maintenance. Usually, this kind of emergency patching completes without any issues, 99% with an estimated downtime of 10-25 minutes. We have done the updates, however, the server had a kernel panic, were an incorrect patched binary corrupted the kernel heap stack and on which we are forced to reboot by contacting NOC. At the blade start-up, the POST diagnostics test the CPUs, DIMMs, HDDs, and adapter cards on which POST message showed disturbing error with the filesystem, as reported by NOC Technicians. Hence boot process was in kind of 'stuck' stage. So we fired up the IPMi, connected to the server and initialized the second reboot. During this course, before the GRUB Bootloader came online, the server showed filesystem error and went to FSCK. So we continued with file system check which took considerable hours. After the check server came online. However we saw some corrupt server binary data still resident and so some essential services are refusing to start up, which include the web services, email, etc. However, rest assured your data was safe.
Our senior technicians started resolving the matter with utmost priority. An estimated ETA was not determined at initial stages, though engineers kept working on the case. On several instances, we were successful in starting different services and then finally, all services were restored. As of now, the selected servers for pathing are secured and the process is complete. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Hostcats, and we are making continuous improvements to make our systems better. In case, if your website is not functioning or have any related issues, please open a support ticket with us.
We will be extending the same patching to other Linux servers within this week, though we have formulated a plan, that will be executed with virtually no downtime for the servers, on the wake of the previous incident. When that is initiated, we will be notified here, this thread, about the status as it develops.
Monday, July 31, 2017