There’s a lively discussion going on on TidBITS Talk about backups. I posted a story about a nightmare I had trying to restore one of my servers, even though I had thorough and current backups that I had tested. Since TidBITS is Apple-oriented, I thought I might repost here for the rest of us.

-""-.,,.-""-.,,.-""-.,,.-""-.,,.-""-.,,.-""-.,.-""-”

A previous poster said:

We do the testing manually by restoring a dozen, or so, randomly selected files.

I once had a mission-critical server (Linux) backed up daily to Amazon Simple Storage Service using duplicity. I would periodically restore a few files just to make sure everything was working.

One day the server’s disk became corrupted (thanks, systemd!) and it wouldn’t boot. Because I had a full backup that was less than a day old, I decided the most expedient course was to just restore the entire server.

I started the restore. I knew it would be slow: the files were stored online at AWS. When I did test restores, it took a while (maybe 1-2 minutes per file restored) but I figured that was overhead searching for the files through hundreds of incremental backups – overhead that wouldn’t apply to every. single. file. on a full restore.

I was wrong.

After 24 hours, the restore was less than 1% complete. Thinking that internet delays were the culprit, I downloaded the entire backup set which required only a few hours. I then ran the restore locally. It didn’t speed up. It turns out that duplicity really doesn’t like having hundreds of incremental backups accumulated, and I was doing daily incrementals while rotating the backups (starting a new full backup) only once a year.

One minute per file doesn’t seem that bad, but for a server with right around a million files on it, I think that makes the restore time in the neighborhood of two years.

Two morals:

  1. Limited testing is a good thing, but there needs to be at least some full testing. Which I did, but only a month or so after starting using duplicity and I didn’t repeat it after the number of incrementals had built up. It is difficult, time-consuming, and expensive to do full disaster recovery drills regularly, but without doing so you risk being bitten by bugs that won’t show up with limited testing.

  2. Those of you who have backup strategies that use completely different tools to make multiple backup sets (Time Machine + CCC, for example) are doing a Very Good Thing™.

I had a bootable clone of the server that was only a few weeks out of date. I used that, and updated the likely places where data loss might occur using the duplicity backups. All was well except that it aged me a bit.

—2p

← previous|next →