A Case Study in Backup Nightmares

It was a month before Yom Kippur, the holiest day of the year. The IBM Central Storage server sent out a distress email to the Sasha, our sys-admin – “My backup battery needs to be replaced, three years have gone by and I’m going to depart this life in a month”.

“Time travels quickly when you’re having fun”, thought Sasha. He was not very worried as this was just the backup battery in case both the main power and UPS fail. For the geek readers – the battery allows to write the cached files from memory to the disk.

Our lovely sys-admin called up his IBM reseller, but unfortunately they were out of stock for this model and promised to call back. Three weeks later, the battery sent a reminder:” I’m moving to a better world in a week, please replace me!”

Sasha realized the reseller never called him back, which was not a huge surprise. He called them again. The reseller still didn’t have the right battery, but had a compatible model authorized by Big Blue. A special replacement ceremony was scheduled just two days before the holidays.

Replacing the battery went well. Strangely, it required turning the server off, but the beast came up again nicely. For exactly one minute. Then it completely crashed !

Then they called me, poor director of development, also in charge of development infrastructure.

Me: What’s Up ?

Sasha : The main storage Shark is down.

What are the implications ?

Sasha: We don’t know. We think everything is down.  SAP, ClearCase, ClearQuest , personal home directories.

Me: Where is the  IT Director, Mr Wolf ?

Sasha : Two weeks’ vacation in Italy.

Me: Where is the backup admin?

He left the company one month ago, no replacement yet.

Me: I’m coming over.

When I came over the place was in mayhem. The CIO has arrived , but he didn’t have a technical clue of what’s going on.

Naturally the first thing Sasha tried to do is to restore the storage from the backup tapes. He quickly found out that the index file for the tapes is stored on (pause here, embrace yourself) the central server itself.  Sasha blessed the idiot that decided to put it there, just because it was the fastest storage available.

“Never mind”, he thought, “I will restore the files manually”. Little did he know that the Tivoli restore software has a bug. If   the tapes are delivered in the wrong order the daemon crashes and all the restore processes have to be started from scratch.

I came over to poor Sasha.

Me: When will ClearCase be up and running ?

Sasha : I don’t know. We are currently restoring just SAP, because we are afraid if we make a mistake in one of them it will kill all the others.

Me: OK SAP is more important, but when are you going to finish restoring it ?

Sasha : I have no idea, in the current rate it would talk around 22 hours.

Me : What do you mean “I don’t know” ? Don’t you have a DRP in place ? Isn’t the recovery time the key to building the DRP setup ?

Sasha : No. We can only measure the current rate and guess the future.

Me : Why is the rate so slow? , the whole company is stuck and Yom Kippur is approaching.

Sasha laughed sadly.”We wanted to buy a backup system with fives tapes, so restoring can be much quicker. But our CEO thought that this is much too expensive, because in the normal times backup only one tape is busy. He hates paying for idle equipment”.

To be continued …..

Tags: , , ,

One Response to “A Case Study in Backup Nightmares”

  1. A Case Study in Restore Nightmares « Evil Fish Says:

    […] Evil Fish Thoughts on Software, Management & Business « A Case Study in Backup Nightmares […]

Leave a comment