We had to reboot a client’s server this lunchtime as part of a planned routine maintenance programme. Jokingly the engineer I sent asked “what happens if it does not restart?” Well we found out.
The machine is new, having only been onsite since late spring this year, it has been rebooted a number of times since then, regular checks have been carried out on it without issue and the monitoring software had reported no problems. The restart was at the request of an external support engineer, to clear down a set of logs for monitoring purposes.
The Disaster Recovery (DR) plan was put into action by our onsite engineer immediately, when he realised the issue was likely to be serious. By the time he reported the issue to me a specialist server engineer was enroute with a spare DR server onboard. I then went to site to manage the problem – releasing my engineer to go onto other clients who were expecting him.
Now it is just a time thing. We have a DR “back to work” target, at this client, of 72 hours and we have a two pronged recovery process running to meet (and beat) this target.
This plan minimises down down for the client and gives us two solutions we can be working on, rather than the more traditional process of repair broken equipment, then installing operating systems and applications, then recovering data from tape.
If we cannot effect a repair on the client server then we can use the server manufacturers warranty if it is a hardware issue or use an image from the “in service” DR server to repair any software issues. The client can continue business using our DR server.
So we have just finished our pizza and it is back to work.