What is CDP?

Q: How is CDP (continuous data protection) different from traditional host-based replication software?

By Jacob Farmer

CDP is the latest three-letter acronym to bring mass confusion to the mass storage marketplace. “CDP” stands for continuous data protection, which presumably refers to data-protection technology that safeguards data on an ongoing basis, as opposed to running batch-based backup jobs once a day.

No one organization owns the definition, so it’s up to the marketing departments at the various vendors to haggle over what constitutes true CDP. So, depending on whom you ask, you might hear that replication and CDP are one in the same. You might also hear that replication is merely a component of CDP, or that replication and CDP are distinct technologies. In other words, everyone’s got his own definition, and in the meaningless world of three-letter storage technology acronyms, everyone is right.

However, since you asked, I’ll take this opportunity to put forward my own definition and then spend the next few weeks answering hate mail from every vendor pushing CDP products.

By my definition, CDP is a hybrid of replication, snapshot, and backup/restore technologies. Some products replicate data and then take snapshots of the replica to create point-in-time representations of the data. Others take snapshots first and then replicate. Take your pick. There are subtle advantages to each approach. The key is that the product offers a restore interface, much like that of traditional backup software.

The best way to understand my definition of CDP is to look at the shortcomings of traditional backup software, traditional replication, and traditional snapshots. Backup software is cumbersome: It results in unchanged data being backed up over and over again. A file that has not changed in five years might have been backed up 260 times (52 weeks times five years). Meanwhile, the backup system administrator complains about missing backup windows.

Replication, by contrast, moves data to a separate storage system as data changes. There are dozens of replication schemes on the market. Some are synchronous; others are asynchronous. Some work continually as files or blocks are updated; others send deltas in small batches, maybe every hour or on some other practical interval.

There are two main reasons why replication has not replaced traditional backup. First, a corruption or deletion of the primary storage gets propagated to the replica. Second, replication software does not offer granular restore capabilities such as that offered by backup software.

There are dozens of ways to do snapshots, but they all more or less give you logical representations of your data at different points in time. Snapshots have not replaced backup software for three main reasons: First, snapshots do not protect you from a catastrophic hardware failure. If the disk array crashes, the snapshots as well as the primary representation of the data are gone. Second, snapshots lack a user-friendly restore mechanism. Many snapshot implementations require you to mount an entire volume on a dedicated server to pick off a select file or object. Some snapshot systems put files in hidden directories off the original directory, but this is still a bit cumbersome and requires a sophisticated user to retrieve the files. Third, few snapshot systems offer any meaningful retention policies. Usually, you are limited to a fixed number of snapshots and you cannot save them forever-at least not without other negative consequences.

Now imagine software that is smart enough to back up only block-level or transaction-level changes, much the same way that replication software does. In other words, you do a full backup once and thereafter you only back up the changes. You then combine the power of snapshot technology to save representations of your primary data at different points in time without consuming insane amounts of disk. Finally, you add a familiar point, click, drag, drop restore interface so that you can restore what-ever you need from whatever point in time. As an added bonus, you might get advanced features such as mailbox- and message-level restore, bare-metal restore, remote site replication, etc. This is my definition of CDP.

I am in the minority in that I do not believe that data changes need to be recorded continuously, down to the individual transaction. Similarly, I do not believe that most people want or need to restore from any point in time down to the nanosecond. Rather, I believe that continuous simply means “more often than once per night.”

Moreover, I believe that the real value of CDP technology is not that it is continuous; the advantage is that it is super-efficient. CDP eliminates backup windows. It removes the system overhead of batch-based backups. It economizes on secondary disk storage by only storing deltas rather than a series of full backups and incremental backups. As an added bonus it allows more-granular recovery points.

So, to answer your question: Conventional replication software is a component in CDP, but I think the marketing folks at the replication vendors who have jumped on the CDP bandwagon are taking too much poetic license in labeling their products as CDP.

In fact, it’s too bad the industry chose the acronym “CDP” to describe the next generation of backup/restore technology. In so doing, it failed to call attention to the shortcomings of traditional backup software and the true path to salvation. Meanwhile, the acronym invited all kinds of vendors to redefine their existing products as CDP, causing further confusion and sending the message to consumers that “CDP” is just a new term for old technology. Follow my definition. You will find that real solutions exist on the market, and they do work wonders!

Click here to enlarge image

Jacob Farmer is chief technology officer at Cambridge Computer (www.cambridgecomputer.com) in Waltham, MA. He can be contacted at jacobf@cambridgecomputer.com.

This article was originally published on September 01, 2006