fbpx

Server Down

We had to reboot a client’s server this lunchtime as part of a planned routine maintenance programme. Jokingly the engineer I sent asked “what happens if it does not restart?” Well we found out.

The machine is new, having only been onsite since late spring this year, it has been rebooted a number of times since then, regular checks have been carried out on it without issue and the monitoring software had reported no problems. The restart was at the request of an external support engineer, to clear down a set of logs for monitoring purposes.

The Disaster Recovery (DR) plan was put into action by our onsite engineer immediately, when he realised the issue was likely to be serious. By the time he reported the issue to me a specialist server engineer was enroute with a spare DR server onboard. I then went to site to manage the problem – releasing my engineer to go onto other clients who were expecting him.

Now it is just a time thing. We have a DR “back to work” target, at this client, of 72 hours and we have a two pronged recovery process running to meet (and beat) this target.

  • In one office we have a small temporary network set up recovering a server image (taken in the early hours of this morning) to the DR server. Once completed this box can be configured to take the broken server’s place on the network.
  • We are testing and then hoping to repair the issue on the client server. If we can repair it we will let this run in the morning.
  • This plan minimises down down for the client and gives us two solutions we can be working on, rather than the more traditional process of repair broken equipment, then installing operating systems and applications, then recovering data from tape.

    If we cannot effect a repair on the client server then we can use the server manufacturers warranty if it is a hardware issue or use an image from the “in service” DR server to repair any software issues. The client can continue business using our DR server.

    So we have just finished our pizza and it is back to work.