Protecting your Digital Ass(ets) Part 3: Recovering from Failure

There’s a common idiom that “an ounce of prevention is worth a pound of cure." Hopefully even if you use the metric system you can understand that the idea here is that a smaller amount of effort to prepare for and prevent catastrophe can save you much, much larger amounts of effort later, when catastrophe strikes. That’s why when it comes to protecting your digital assets the best practices and recommendations outlined in Part 2 are so important.

But even when you’ve done everything right, things can still go wrong. Sometimes very, very wrong. We sincerely hope that it never does, because that level of anxiety isn’t nice to wish on anyone. But knowing what tools are available to have up your sleeve for when things do go wrong is actually incredibly useful.

Unfortunately, the best way to know how to recover from a variety of failures is experience. The more experience you have with it, the easier things are to fix, or to predict where things could be wrong. But that means you’d have to experience a lot of your own (or client) failures to get that experience and let’s be honest, that’s not what anyone wants!

The second best way know how to recover from failure is from listening and learning from those with experience. So before we dive into tools, here’s a story.

That one time…

It was week 5 of running DIT on a small independent film, and because of the small crew and tight budgets, I’d been living on location for the previous two and a half weeks, working 16 hour days, 6 days a week, running DIT out of a tent, and sleeping in the production trailer at night.

In hindsight, that was pretty stupid - I was exhausted.

But I was young, and because of the low capacity of storage media, long transfer times, and isolated location of the shoot, the producers and I had figured this would be the most expedient way of getting things done.

I’d just finished the day’s asset transfer, day 28 of 34 expected. Despite my pleas and trying to account for overage in my initial calculations, the AC’s kept starting the camera 20 to 30 seconds before they needed to and leaving it running long after the take, burning many extra gigabytes of space, which at the time was a bigger premium than it is today. I’d spec’d the primary storage RAID for 2x the shooting ratio, and an extra 50% on top of that for start-stop overhead, but it was almost full - I wouldn’t be able to fit another day’s RAWs on it.

After looking at how much we’d been shooting, I figured we could reclaim space by deleting the edit proxy files off of the RAID, since they were on additional external hard drives. I should be able to fit the next few days of RAWs and maybe just finish the shoot on that hardware without a problem.

At that point I’d been running tape backup copies every day, two duplicate tapes, to makes sure that if anything happened to the RAID we’d be okay. But was only keeping a single ‘live’ copy of the RAWs, which under normal circumstances would be fine. Of course, the tape drive had been acting a little funny, and I wasn’t getting as much data on each tape as I thought I should be able to; and it was taking a little while longer to run than it should have. I’d made a note of that and was going to diagnose it once I was back at civilization and had an internet connection.

After double checking that the proxies were on external drives, I clicked into window with the proxy folders on the RAID, selected them all, deleted them, and emptied the trash. Then I looked at how much free space was available.

It was now 95% free.

Well that can’t be right, unless I deleted the RAWs instead of the proxies.

…………. Oh, no….no...no...no...no.no.no.no ………….

It’s impossible to accurately describe the sinking feeling that comes with the realization that you’ve just made a really, really big mistake. Somewhere combination of “I’m about to pass out” and “I think I’m going to be sick”, with maybe a dash of “reality doesn’t feel real."

The feeling quickly passed as I remembered I had backups! I threw in the first tape and started to load the file index to start copying the RAWs back and fix… Error: This tape cannot be read.

Let me just try that again, it should- Error: This tape cannot be read.

Every single tape, both backup sets. Error: This tape cannot be read.

Catastrophic Failure?

A catastrophic failure is a failure from which there is no chance of recovery. In digital content creation terms, it usually means that you simultaneously have problems with your primary and all backup copies of your assets, and nothing can be done about it. Needless to say, usually you want to avoid catastrophic failure at all costs.

Fortunately, catastrophic errors are very close to impossible if you’re following the best practices for asset management. And many types of errors that on their face appear to be catastrophic aren’t; there’s often things you can do to get a partial or full recovery of information.

That’s what happened to me after my human error deleted our primary copies, and hardware error prevented me from accessing the backups. I had an inkling that the tapes ‘might’ be okay, if I could find another tape drive. But the tape generation we were using was still relatively new, and we were the only ones in the state with that generation of drive. It was already 9pm and would be 10-11 before I could be back at civilization anyway, so anything that could be done tape wise could be the following day at the earliest.

So what did I do? As soon as I realized the predicament I was in, the first thing I did was shut everything down, especially the RAID unit I was storing the footage on. Why? Plan “c”: because when you delete a file on magnetic or solid state media, it’s still there until you overwrite it.

Every file that’s on a hard drive (or on solid state media) has two parts: the data and the metadata. The data is the actual contents of the file, and is stored in segments around the disk or media. The metadata is information about the data; in a modern file system this includes the file name, what folder it’s in, when it was created, and the ‘file pointer’ information - where all of the parts of the file are stored on the actual media.

Every file and folder’s metadata is stored in the file system portion of the disk or media, a chunk of space that’s reserved at the lowest address spaces of the disk/logical partition. This let’s the host computer (or camera, etc) doesn’t have to read the whole disk to know what’s on it, just the very small file system portion. It references the file system to know where to look for a file’s data, and to know what parts of the media are ‘free space’ it can write to.

When you delete any file, two things happen: 1) the file pointer is flagged as ‘removed’, and 2) the global filesystem table containing information about which parts of the media are free to allocate new files to is updated, to add back the locations the file occupied. This is a simplification, but the important fact is that nothing is actually erased, until you write new data to the media, or the computer cleans up the file system.

By turning off the RAID, I could ensure that no more files were written to the disk to overwrite any of the data on the drive until I could get it all off.

After calling the producers and arranging to have another person come to run security at the location, I packed my equipment and ran back to my computer lab where I had access to far more tools (and could always get more from the internet!)

The first tool I tried was a set of “undelete” tools - programs that read the raw file system data and can usually recover the pointers to files and folders flagged for removal.

Unfortunately, the particular RAID enclosure we were using prevented these tools from running properly and I was limited to only a few dozen files that could be found by recovering file system pointers. This needed a lower level disk scan.

Low level disk scans bypass the file system and read the data stored on the media directly, using pattern matching algorithms to figure out exactly what each type of file each chunk of data is and copying it off the media to another storage device. Usually this results in lost metadata and lost directory structure, but at least the actual file contents are intact.

The general purpose tools at the time didn’t recognize R3D files, so I switched to RED’s own command line tool: REDUNDEAD. REDUNDEAD scanned the drive, found every R3D we’d shot, and dumped them all in a single folder. Sure, all of our footage (1800 or so files) was now all in one folder, but at least they were all there!

Low level disk scans take a while - it took the better part of 16 hours to get done. The first 8 hours were to free up enough space on a target device to copy things too, and the second 8 hours was how long it took for REDUNDEAD to comb through the device and copy the data off. I spent another 4 hours recreating the directory structures and getting the RAWs to a useable point for editing and storage. Then 8 hours copying it all back to the original RAID for storage.

28+ hours to fix a 5 second error. An ounce of prevention prevents a pound of cure.

In the end, we had 100% recovery. As it turned out, the backup tapes were actually 100% intact too, a fact we were only able to discover once we got a replacement drive, so even if plan “c” had failed, there was always plan “d”. Plan “e” was to edit and master the film from the proxy files. That would have been sad, but would have worked as an absolute last resort.

What are the options?

The best way to avoid a catastrophic failure is to have a disaster plan in place for when things go wrong. In our last two posts we’ve looked at the things you can do to reduce the chance of things going wrong, but sometimes things end up failing in ways you didn’t or couldn’t expect.

Usually plan “b” is to restore from backups, but what happens if they aren’t available? Or what if you’re dealing with original camera media so backups were never there to begin with? What if the problem is something like the major bug with Adobe Premiere CC 2017 version 11.1, that literally deleted some of the files it was linked to, and that linking to your backup files could result in the same files being deleted? What’s plan “c” for you?

In order to make any disaster plan, you need to know what options or tools are available. Here are a few everyone who works in storing and managing digital assets should know about, and in what situations they can work to recover files:

File System Recovery
Media Data Recovery (Software)
Professional Data Recovery Services (Hardware)
Cloud Backup Services

File System Recovery

File system recovery tools, or “undelete” tools, are the simplest of all of the disaster tools. They’re most useful in two specific scenarios: first, when you’ve done a quick format of media that still had necessary information on it, and second, when you’ve just deleted a file and haven’t given the operating system any time to add new files to the volume.

As I mentioned above, when you delete a file by emptying the trash (or doing an immediate delete), all you’re really doing is flagging the blocks occupied by those files as available, and changing the flags on the file system pointer to say “this file is deleted, feel free to overwrite this entry”.

When you do a quick format of a hard drive or other media, you’re doing something similar: flag all storage blocks as free, and start a new file system tree by rewriting the head of the file system (again this is a simplification that varies file system to file system). On solid state media this unfortunately erases the entire file system and all its pointers, but on magnetic media (hard drives and LTO tapes) previous versions of the file system usually aren’t overwritten until the space is needed. Quick formats don’t erase the data on a drive, just the first few bits of a file system.

When you delete a file or quick format the media, there’s a lot of time before the operating system actually reuses that file system space, and in that time the information about all of the files is still written on the media. File system recovery tools simply bypass what the file system says with respect to what files exist, and reads the data directly to see if there are file entries marked as ‘deleted,’ or from a previous version of the file system.

The major advantage of these tools is speed: file recovery happens really quickly since it’s just reading a small part of the storage media. They usually don’t require any place to copy the file data to as well - instead the tool usually fixes the file system to restore access to the file (files) in place.

The catch with this of course is that you might not be able to get a 100% recovery - parts of the file system may already be overwritten, parts of the data on disk may be overwritten, and the file system may not be capable of an in-place restoration for a variety of reasons we won’t get into.

In the cases where file system recovery will work, though, it’s essential to minimize the time between deletion (formatting) and running the recovery tool. All operating systems regularly do file system cleanups on attached drives, meaning that they may overwrite some of the pointers to deleted files as they trim and consolidate the pointers to live files within the file system’s structure. Which means that as soon as you realized files have been lost, you need to stop what you’re doing, unmount any affected drives, and get to work on recovery as soon as possible.

This is especially true on Mac computers since they tend to read and write more to disks than you’d expect. Finder is constantly storing little hidden files that contain thumbnails for a folder’s image or video files directly on the disk, as well as index information related to Spotlight searches. Which means that by leaving it mounted and simply exploring the other files on a drive you can accidentally start overwriting file system data making in-place recovery less and less likely.

If file system recovery tools don’t end up working for you, or don’t end up giving you a 100% restoration, the next step in working with restoring data is to run a media data recovery tool.

Media Data Recovery

Just like file system recovery tools read the raw file system data from the disk, media data recovery tools read the raw information on the disk itself to see if it contains chunks of data that it recognizes, like JPEG images, or H.264 video.

Media data recovery only works before the data is actually overwritten, which means it’s important to run it as soon as possible. This is especially true with solid state drives. While hard drives will only overwrite segments when it’s time to add new data, solid state drives will usually execute a trim operation in the background when you’re not writing to it, where it erases the blocks marked for deletion since that speeds up future write operations. Once a trim is executed on SSD data, that data cannot be recovered. This is just another reason that backups and archives should be done to magnetic media (hard drives and LTO tapes) instead of solid state drives.

Since you’re bypassing the filesystem, and the tool has no idea what’s on a drive before it starts, these kinds of tools take a while to run. No, really, they take a long time to run. Essentially they have to read every bit of data on the disk and figure out what kind of file it is. As a result, running an analysis is only a hair faster than running an analysis with immediate recovery. The tools are limited by how fast they can read the data, and how long it takes to analyze small chunks. When it analyzes the data, they’re temporarily stored in RAM, and it’s then easy to write that chunk out to a recovery disk while reading in the next chunk of data.

On small volumes, like CF cards or original camera media, running the analysis can let you know if the tool’s going to work before running the full recovery, while on very large volumes it’s usually best to simply run the tool if you’re pretty certain that you’ll get recovery from it, since otherwise you’re effectively doubling your time. If you choose the immediate recovery option, watch as the files are restored to ensure that it’s finding all of the expected data.

Which brings us to the first important thing to know about media data recovery: you need a second storage device to copy the data to, one that has more free space than the size of the drive you’re working with. The media recovery tool needs to be able to copy 100% of the data it finds somewhere else to access it, so it doesn’t overwrite what’s already on the disk.

A second important note is that media data recovery tools built for consumers usually don’t work for professionals trying to recover data from their media drives. Simply put, the software scouring the drive looking for your files needs to understand what kind of files you’re using. It’s the only way it knows when it's found one! That’s why REDUNDEAD works so well on finding R3D files: it understands R3D data structures and knows how to reassemble the files!

Consumer media data recovery tools, especially the free or inexpensive ones, focus only on the files that are common on the average person’s computer: JPEG files, pngs, tiffs, mp3s, mp4 or other H.264 video, etc. They usually can’t identify things like ProRes or other camera RAWs formats.

More expensive and reliable tools do though. Our software of choice is the Data Rescue line of recovery tools by Prosoft Engineering - it’s still fairly inexpensive for professionals, but covers the majority of types of data we use (including ProRes). It doesn’t cover all of the different RAW formats though, so it’s important to research the tool you’re planning on using to see if it will work in your situation.

Media Data Recovery tools work best when the data you’re trying to recover is stored in contiguous blocks. A 1 megabyte JPEG image, for instance, requires 256 blocks of 4KB in size, or 16 blocks of 64KB in size. If these blocks are fragmented, meaning not stored directly one after the other, parts of the image may end up corrupted or missing. The tools can’t recognize the fragmented pieces it finds later as being part of the earlier image, since they have no way of identifying where they came from, or how they would fit together into a JPEG image.

The easiest way to make sure that your data is stored as contiguous blocks is to start with new storage media. If you are going to reuse media, format it completely instead of deleting files. Then, treat your backup drives as if you cannot delete any of the files on them - a write once, read many (WORM) strategy. The problem when you delete files, especially small files, is that they leave little gaps of available blocks between other files. Typically the file system will try to fill these gaps as it gets full, which results in file fragmentation.

The more deletions and additions you do to a drive, the more fragmented it can become, especially if you’re working with drives that are nearly full. Properly prepared backup drives and media with contiguous data yield far better results in media data recovery than working or editing drives. If you’re following the best practice procedures from Part 2, you should be in pretty good shape when it comes to data recovery from your archive copies.

A side note here: if for whatever reason you can’t run data recovery software on the drive immediately, there are tools for both Windows and MacOS (and Linux) that will let you clone the data on the drive, bit for bit, and then access it as if it were on a virtual hard drive. This should ONLY be a last resort, but as a last resort it can be a way of quickly capturing what’s on the drive before losing physical access to it.

Professional Data Recovery

When all else fails - and I mean, when ALL else fails - you may need to look at professional data recovery solutions. These services are usually quite expensive and have limited use cases when it comes to when they actually work.

To be clear, we’re not talking about taking your drive to a computer store or ‘local experts’ who are simply going to run the same software tools available to you. This is for times what you have substantial hardware failure or have overwritten data on magnetic media and absolutely need to get access to what was there.

So imagine you have a failed hard drive, RAID set, or flash media. The media won’t mount onto a computer, or a port on one of the drives burned out. Professional data recovery solutions specialize in fixing these kinds of problem, doing things like transplanting hard drive platters into new physical setups, replacing hard drive or solid state controllers, or otherwise removing damaged components and replacing them with working ones to regain access to damaged media. They may also be able to work around scratches or gouges on the physical media that have impeded operation.

This doesn’t always work though. Sometimes damage is too great. Sometimes the flash media itself has fried and so a new controller won’t fix it. To give yourself the best chances, your best bet here is to always use magnetic media for your archives - it simply retains more information and is less susceptible to full media failure. It’s not infallible, though, and it’s best recovery chances are when you’re dealing with some form of physical or electrical damage to the controllers or ports, since that leaves the data intact.

Magnetic media can even retain information that’s been overwritten. The magnetic patterns on a disk are slightly different when you’re overwriting existing data than when you’re writing to a new disk. Essentially, the difference between a “1” and a “0” on the drive is a measure of changing magnetic strength. As long as the change in magnetic field strength between one bit and another is above or below a specific threshold value, it’s considered to be one or the other. When you overwrite a “non change” with a “change”, it’s a slightly weaker change than overwriting a “change” with a “change”, just like overwriting a “change” with a “non change” will be slightly different magnetic pattern than overwriting a “non change” with another “non change”.

The drive’s controller manages these threshold values and only tells the computer whether it’s seeing a 1 or a 0, rather than the exact magnetic field strength. But by bypassing the controller that’s interpreting the signals, the technicians in professional data recovery services can essentially “see” what was there before and reconstruct the data by ‘subtracting’ the primary (current) data from the signal to recover the previous data from the noise.

This, by the way, is why any drive to be recycled by the US department of defence requires seven passes of random data to be written to it before disposal: essentially adding more and more noise to the real data to the point where it can’t be decoded. It’s also why zeroing out a hard drive can help get rid of persistent operating system errors that occasionally survive a reinstall.

So hard drives platters and magnetic tape can be cleaned and remounted, and data recovered from previous generations of data on the drive or tape. But once again, SSDs are a little more fickle. When they’re overwritten, or undergo a trim operation as mentioned, the blocks of data are actually fully reset and it’s impossible to see what data was there before. Which is great for security, but unless a solid state drive or external media card’s controller has failed, it’s essentially impossible to recover information from a damaged or overwritten SSD.

These services come at a price though, usually in the thousands of dollars per drive, though the specifics depend a lot on what kind of media you’re dealing with, whether it’s a RAID, what’s gone wrong etc. Just know that the answer to how much is usually “a lot." And how long will it take? “A while” - weeks to months, usually. So try to avoid putting yourself in a place where this is necessary! But it’s still a good absolute last resort option.

Cloud Backup Services

So we’ve seen three different ways of recovering from disaster, depending on how bad the damage is to your media. “But,” I hear you crying, “what if I don’t store the files on my drives?” Let’s talk about the pros and cons of cloud storage.

Pros: It’s not on your drive, or in your house. If something goes wrong, you don’t have to fix it. It also allows for syncing between computers and automatic versioning history.

Cons: For professional video files it’s extremely expensive. Most common services have data caps, your ISP upload rates are usually not great, and getting your data back can take some time.

Most of the common cloud services, like Dropbox, Box, OneDrive, or Google Drive are really, really great at dealing with small files. Most offer a level of versioning that’s fantastic for keeping project files safe - if someone makes a change or something’s overwritten, you can recover previous versions without worrying about trying to keep daily backups of all of your files. They also work for synchronizing assets between local computers. We strongly recommend this as a collaboration and backup solution for daily use files, for small workgroups.

The downside is that they tend to have either file size or storage limits, or both. That’s because they’re a company trying to make a profit, and giving you unlimited access to everything doesn’t usually make it cost effective for them. So when you’re dealing with the terabytes of data generated in a day on a video shoot, you’re going to run into problems with data storage limits, or at very least into the speed bottleneck of trying to upload that much data.

If you do want to use web storage for RAW files though, and have a fast enough internet connection to make that happen, the best alternative is to skip the consumer friendly services and skip right to the web storage providers - companies with large, redundant data centers that specialize in storing data.

The two big titans here are Amazon S3 and Microsoft Azure, with Google Cloud Storage falling in somewhere behind. The costs for data storage and access are usually priced per gigabyte stored, and again for each gigabyte accessed, meaning that you’ll be charged monthly for the amount of data already up to the cloud, and every time you access that data.

Remember that these are primarily built as storage services for web services hosting databases, photo libraries, or compressed videos rather than client data storage. For an example about how much this can cost, right now, the Mystery Box archive is around 200 TB, so even at a simple low price of $0.01 / gigabyte per month (that’s low, right?) we’d be paying over $2,000 monthly, just to have it stored in the cloud. If we need to access it, we pay more.

What’s worse is that the highest speed internet access we have at the office only lets us upload at 25mbps, meaning that uploading that data to the cloud would take around 740 days, with another 4 days for each additional terabyte as we generate it.

Even a product like Backblaze or Carbonite, which are great for consumers and single users, can end up costing video businesses huge amounts every month. For single seat editors with fast enough internet connections it may be the best option, but for even small video teams the costs can add up pretty quickly - again, our archive would cost us $1,000 / month for storage with Backblaze.

Maybe paying to store all of your data in the cloud is a great idea for your business. If you have that kind of money to spend on an offsite backup, go for it! Amazon even has secure hard drive arrays they can send you if you’re planning on uploading huge amounts of data at a time.

On the other hand, for small businesses of less than 20 people, it’s almost always more economical to be using tape backup solutions, since the price per terabyte of redundant tape storage pays for itself within a year of each terabyte stored. Like we said, though, cloud storage is fantastic for synchronization and versioning the project files, as well as storing the associated proxies and other small media needed for editing. It’s a great balance of speed, efficiency, and cost, keeping our smaller video files online, while keeping our large RAWs and archive files offline and on more cost effective media.

I want to reiterate at the end of this how important it is to follow the best practices found in part 2 for protecting your footage. Genuinely we hope that no one ever needs to go through data recovery in any way. But even if you’re doing everything right, there are times when random and unexpected things happen. That was made apparent in the the Adobe Premiere error from a few months back, as well as in the story we started but never finished in Part 1.

So what happened to my friends with the checksums, where the computer told them everything was alright? I ended up on their set diagnosing the issue. And what I found was something that should not be possible: their camera, the Blackmagic URSA Mini, had transplanted the file system from the first C-Fast card of the day onto the second C-Fast card of the day, but only after they told the camera to eject the second card.

It meant that everything from the second half of the day played back fine on the camera, until it was ejected. From then on, all of the file pointers in the file system tried to tell the camera or the operating system that the files started in the wrong places. The camera couldn’t play it back, and QuickTime could only open the first file. Because the file data for all the clips from the second half of the day was still intact on the card, just the file system pointers were bad. Fortunately for everyone involved media data recovery recovered all of the files except one (which happened to be a bad take anyway).

I want to end this three-part series on protecting your digital assets with a plea to all independent producers and independent cinematographers who simply run their own DIT, or have interns or the cheapest individuals in the area handle their data. Please don’t. You’re taking a big gamble, and even if it’s paid off so far, when your number comes up (and it probably will), you’ll be in a big pickle if you don’t thoroughly understand how the technology works.

Knowing what tools are available can help recover from the unpredictable disasters and prevent total catastrophe, but those tools are even better in the hands of those who truly understand how they work, and how to use them.

Protect your data. Implement backups and archives. Spend the time, spend the money. Budget for it. Make it happen. Don’t wait until things go wrong to realize you can’t afford not to.

Written by Samuel Bilodeau, Head of Technology and Post Production