Catastrophic Server Failure Case Study

The Challenge:

A new office location with a small server room, another MSP provider, inadequate cooling, and a server nearing end-of-life resulted in a catastrophic hardware failure resulting in business shutdown.  

The Outcome:

A new server was procured by ARC in an expedited manner, configured, and implemented within a week. 

About the Client:

The client is a cost/power management company committed to helping Canadian companies control & improve their profitability by lowering ongoing expenses with respect to electricity, natural gas, and card payment processing.

They are the largest independent energy aggregator in Canada and possess expertise to help their clients with electricity and natural gas savings, contain costs, or both.

They are also experts in credit card processing solutions. With more than $1 billion in credit card payments processed annually, they provide world-class solutions, world-class service and world-class pricing

Business Case:

ARC started working with the cost/power management company and have been the MSP for roughly a decade. After moving office locations the client set up their servers in a small server room with insufficient cooling. This was not immediately apparent until after the servers failed; it was discovered that the building turned off their HVAC systems overnight and on the weekends to conserve power. The servers were nearing end of life and the stress of the overheating environment caused a catastrophic failure. The failure shut the business down. Efforts were made to restart the server which was only partially successful. The server needed immediate replacement in an emergency manner.

Business Solution:

Upon failure of the server, ARC technicians undertook restarting the server. The server did restart to enable the business to restart. Once re-started ARC technicians began an in-depth investigation into the failure. It was determined the server was experiencing a catastrophic failure. ARC technicians brought in a loaner server and transferred the workload to this loaner. Once up and running, the failing server was removed. While the loaner was being installed arrangements were made to expedite the procurement of a new server. Since this was a business-critical emergency the highest priority was given to shipping and configuring the new server. Once configured, the new server was installed, workloads transferred, and connections were re-established with the new server. The loaner server was left in place as a shadow server in the event of an unforeseen problem. This would minimize the outage if such an event occurred again. After successfully running for 2 weeks with no issues, the loaner server was removed. The ability to provide an immediate loaner enabled the organization to keep its business up and running while new equipment was procured and installed.

“ARC’s quick response and ability to get us a loaner server quickly allowed us to get our business up and running again within a matter of days as opposed to weeks.

The failure analysis exposed an issue with our cooling system in the server room and let us take steps to monitor the temperature and prevent a critical overheating scenario from happening again.

We also implemented a replication strategy to ensure that should we ever suffer a catastrophic failure again we would be able to be back up and running in a matter of hours”.


Recommended Posts