/********************************************************* END OF STYLE RULES *********************************************************/

Wednesday, August 23, 2006

Notes on Continuous Data Protection (CDP)

I'm using my blog as notebook again today. Today's topic is CDP.


Background
The benefit of CDP is you can restore data to any arbitrary point in time - unlike Snapshot/PIT, where you might snapshot say once every 24 hours. With CDP, if you corrupt your database, you can restore it to it's state just before the event, as opposed to Snap where you my have to go back to it's state 24 hours ago. So, CDP is for protecting against an application or operator error that corrupts your database. It kind of reminds me of the VMS workstation I had years ago. By default, VMS saved three revisions of each file (and you could make it more). Usually, it just wasted space but occasionally I screwed up a file and was really glad I had the older rev.

Technology
The basic technology is similar to Copy-on-Write Snapshot. It can run in a Volume Manager, in the Array, or an intermediate Switch/Virtualization Engine. With COW you allocate some amount of storage that is smaller than the original volume. On a write to the original volume/LUN, the old data as it existed at the time the snap was taken, is copied to this extra space and the new data is written to the original volume/LUN as usual, and a log of writes gets updated. Reads to the original volume/LUN return the new data, as normal. The snap gets exposed as a new, Read-Only Vol/LUN. On a read, the log is checked. If reading blocks that have updated, the old blocks are returned from the extra storage space. If reading blocks that have NOT been updated, they are read from the original Vol/LUN. So, the amount of extra space to allocate is a function of a) how long you want to keep the snap, and b) how much data your apps write to this volume. The downsides of COW are a) you need more storage space, and 2)the performance impact of copying blocks on write commands.


CDP is an enhancement. Instead of just copying the old data to the extra space and logging which blocks have been written, CDP time stamps the updated blocks and keeps every copy of blocks that get written multiple times. Then, the administrator can take a snapshot at any point in time and on reads, the CDP logic will find the copy of the block as it existed at that point in time. The downside is a volume that gets lots of writes will require much larger extra space. The performance impact is about the same as COW - a data copy and log update for each write.

Who is doing CDP - The big players

EMC. EMC has two CDP products. RecoverPoint , which is really just a feature of their host-based Replication Manager. I'm not sure if this means it runs in a host VM, or it's just a host-based management front end for these data services running in the array. Need to look into that. Second EMC bought startup Kashya which makes CDP to run on their Invista suite that runs on intelligent switches/virtualization platforms (called Connectrix, based on McData). According to the Kashya web site, their CDP also runs on the Cisco MDS switch, and IBM SAN Volume Controller. According to their website, CDP is a Module in a suite of data services (replication, etc.)

Veritas Released CDP for Windows with 'Backup Exec 10d'. The interesting twist on this is it includes an interface for end-users to find old copies of files themselves. Claim it has a 'Google-like' interface. Can't find any info on CDP for any other platforms.

Netapp I saw a Byte&Switch article claiming Netapp got CDP with the aquisition of Alacritus in April 2005. Looking at their website though it looks like what Alacritus got them was VTL firmware to create their nearline/SATA backup product and then they partner with Symatec to bundle Backup Exec for Windows and partner with IBM to bundle the Tivoli CDP product.

IBM Has a Tivoli product that runs on client workstations/PCs/Laptops. Does CDP, replication, and backup at the file level. Stores old copies of files over the network to a NAS product like the Netapp. Also, as stated above, the Kashya FW runs on their SAN Controller virtualization product. IBM probably has something for their servers as well.

Microsoft MS provides CDP through a product called Data Protection Manager. Overview description here. It runs on the CIFS file server but claims that saved changes at the BYTE level. To minimize performance impact it saves the changes locally on the file server but then asynchronously replicates those to a disk-based DPM Server (probably running VDS). An interesting feature is client PCs can use the VSS API to request snapshots to retrieve their own files. Similar concept to Veritas Backup Exec where end-users have their own interface.

HP HP released a product last April called Continuous Information Capture which runs under Oracle on Solaris or MS Exchange or SQL server. It includes a SW layer on the Oracle app server which sends changes to a 'Recovery Appliance' and the old data to the secondary storage device.
This is the Mendocino product rebranded.

Sun A search on sun.com/storage returned no CDP products but the search did return an 'ILM Vision' white paper where Sun states: "We envision an advanced state of information lifecycle management that involves both pervasiveness and importance. Our vision includes ...." and they go on to list CDP. Good luck Sun.

The Startups

Found a long list of startups doing CDP including Kashya (aquired by EMC), Mendocino, Revivio, StorActive, Topio, XOSoft, Zetta Systems, Mimosa.


Mendocino As above, provided the CIC product to HP. Website talks a lot about the need to be application aware. Working with Sybase, and presumably Oracle on integrated database management tools that use underlying CDP APIs from Mendocino. Seems to be their unique angle on CDP.


Revivio Sells a CDP Appliance. Also pushing the application integration strategy. Talks about their SW Suite for Application Integration and modules for integrating with Oracle, Sybase, SQL, Exchange, Lotus Notes, etc.


Storactive, now called Atempo Quick read of their website looks like they are doing CDP at file level for PCs. Also provides a direct client UI so end users can recover old versions of files on their own.


Topio Provides block-level SW suite for Remote Replication (asynch via IP), Snapshot, and claims replication is time-stamped. Not sure if they really do CDP though.


XOSoft Another application-aware CDP and availability SW suite. Supports Oracle, Exchange, SQL.


Mimosa CDP for MS Exchange. Website says they are a 'strong' partner with Microsoft. Interesting - their banner uses the Sun motorcycle jumper photo with the 'S' on the side - but no mention of Sun partnership.


Summary
All the big companies have CDP. Looks like CDP has become just another tool in the data management toolbox along with Remote replication, snapshot, etc. As such, seems to be more of a 'feature' than a 'product'. Seems to be two basic approaches: 1) file CDP focused on desktops/notebooks with direct end-user interface to recover old files; and 2)server CDP designed to run under databases with some amount of database integration to make it usable from database tools.