A Case Study in Restore Nightmares

I’m learning to fly but I aint got wings
Coming down is the hardest thing

Well the good old days may not return
And the rocks might melt, and the sea may burn

Tom Petty

Read the first part of the story first.

Few hours have passed and the Sasha felt more secure about starting the recovery of ClearCase source files.  The initial restore was successful (300GB and 50Million files) and everyone was happy. Everyone but yours truly.

Me: How do you know the recovery was successful ?

Sasha: The recovery software says so.

Me: Well, we know what that’s worth. How do I know all the files are OK ?

Sasha : We will run FSCK on all file systems and see what happens.

Me: That’s a good start, but it still does not tell me which files we lost, what data is corrupted , and that no changes happened. And by the way, how long it will take to run FSCK ?

Sasha: These are excellent questions, I don’t have any idea. Let’s try and see what happens.

Backup and Restore Case Study
Backup and Restore Case Study

Running FSCK took a few more hours, but it seemed Ok. We tried to bring ClearCase up but it was not willing to. Surprisingly the restore process didn’t restore all the file permissions and soft links. These had to be added manually.

I was still unsure we didn’t lose all the information. With 300 modules, 20Million files and thousands of branches it is really hard  to know nothing was lost. It could take months before someone tries to build an old package and finds out it is missing. With databases there are multiple Integrity checks in place.With a Version Control system such as ClearCase it is not so simple.

When we called ClearCase guys (also IBM) it turned out there is a secret script that validates the internal consistency of the setup. We decided to run the script and at the same time build all the main products on the compile servers. If all the compilation results come out identical to what we have we can be quite confident we have at least the latest source available.

What we soon discovered is that we can’t do both at the same time. The hidden script was locking the files. We had to run the build after the script. Of course nobody in IBM could estimate how long the script would run .I sent all the developers home at this stage.

A Couple of hours later everything seemed to be in order. I went back home and prepared for the holy day. Yom Kipur is the day in the year when Jews are supposed to ask for forgiveness for the evil they did to their fellow men.It is also a day for reflection I felt this was very appropriate opportunity.

Lessons Learned:

  • A backup without frequent restore exercises is like a Pizza with no cheese
    • Just like High Availability never works in Passive Active mode
  • Everything starts from the requirements. IT is not different than R&D
    • Restore time is one example 🙂
  • If you can’t prove that the restore works you have a serious problem
    • This is a hard one. Think – what will make you sleep well at night.
  • When crisis happens, it is very nice to have a process in place.
    • Start from the phone list of key people
  • Trying to save money might cost you lots of money
    • There is usually a reason for high-end products
  • Trust no one
    • If they say they have backup. Ask them for the DRP plans
    • If they claim to have backup ask them to restore something

Tags: , , , , , ,

One Response to “A Case Study in Restore Nightmares”

  1. A Case Study in Backup Nightmares « Evil Fish Says:

    […] Management & Business « The Questions Reporters Never Ask (But Should) A Case Study in Restore Nightmares […]

Leave a comment