There’s a lively discussion going on on TidBITS Talk about backups. I posted a story about a nightmare I had trying to restore one of my servers, even though I had thorough and current backups that I had tested. Since TidBITS is Apple-oriented, I thought I might repost here for the rest of us.
-""-.,,.-""-.,,.-""-.,,.-""-.,,.-""-.,,.-""-.,.-""-”
A previous poster said:
We do the testing manually by restoring a dozen, or so, randomly selected files.
I once had a mission-critical server (Linux) backed up daily to Amazon Simple Storage Service using duplicity. I would periodically restore a few files just to make sure everything was working.
One day the server’s disk became corrupted (thanks, systemd!) and it wouldn’t boot. Because I had a full backup that was less than a day old, I decided the most expedient course was to just restore the entire server.
I started the restore. I knew it would be slow: the files were stored online at AWS. When I did test restores, it took a while (maybe 1-2 minutes per file restored) but I figured that was overhead searching for the files through hundreds of incremental backups – overhead that wouldn’t apply to every. single. file. on a full restore.
I was wrong.
After 24 hours, the restore was less than 1% complete. Thinking that internet delays were the culprit, I downloaded the entire backup set which required only a few hours. I then ran the restore locally. It didn’t speed up. It turns out that duplicity really doesn’t like having hundreds of incremental backups accumulated, and I was doing daily incrementals while rotating the backups (starting a new full backup) only once a year.
One minute per file doesn’t seem that bad, but for a server with right around a million files on it, I think that makes the restore time in the neighborhood of two years.
Two morals:
-
Limited testing is a good thing, but there needs to be at least some full testing. Which I did, but only a month or so after starting using duplicity and I didn’t repeat it after the number of incrementals had built up. It is difficult, time-consuming, and expensive to do full disaster recovery drills regularly, but without doing so you risk being bitten by bugs that won’t show up with limited testing.
-
Those of you who have backup strategies that use completely different tools to make multiple backup sets (Time Machine + CCC, for example) are doing a Very Good Thing™.
I had a bootable clone of the server that was only a few weeks out of date. I used that, and updated the likely places where data loss might occur using the duplicity backups. All was well except that it aged me a bit.
—2p