BitWackr Blog

Deduplication beyond backup

Backup Deduplication vs. Application Storage Deduplication

with one comment

Since the DD200 announcement in June 2003, DataDomain (now an EMC company) has been at the forefront of backup deduplication technology and has made “deduplication” synonymous with “backup”.

Today, most products with “deduplication” in their name share a common characteristic: they are highly optimized for the backup application. And whatever logo your favorite deduplication product wears – DataDomain, IBM/Diligent, EMC, FalconStor, Atempo, Commvault, CA, Symantec or others – they count on the data stream being sent to the deduplication system being a backup data stream.

Data Deduplication

Block-based data deduplication is a technique that eliminates redundant blocks of data. In a typical deduplication operation, blocks of data are “fingerprinted” using a hashing algorithm that produces a unique, “shorthand” identifier for data blocks. These unique fingerprints along with the blocks of data that produced them are indexed and can be compressed and encrypted (if these functions are supported by the deduplication product) and retained. Duplicate copies of data that have previously been fingerprinted are deduplicated, leaving only a single instance of each unique data block along with its corresponding fingerprint.

The fingerprints along with their corresponding full data representations are stored (in an optionally compressed and encrypted form).

Saying this is easy. Doing it in the data path at speed with a “reasonable” amount of resources and with “reasonable” performance is the challenge.

Data deduplication, regardless of the application for which it is employed, has a number of common characteristics. First, data is ingested by the deduplication engine (in an appliance or in software running on a general purpose server). The ingested data is operated on in blocks (some vendors use fixed-length blocks while others use variable length blocks). Most vendors next run the blocks through a hashing engine that employs a cryptographic function (SHA-1, MD5, etc.) to produce the block’s hash fingerprint. Hash fingerprints are unique. A “hash collision” would occur if two different blocks of data generated the same hash.  Such an event could cause a hash fingerprint to be associated with an incorrect block of data resulting in an error when the data block was retrieved. Although there is a statistical possibility of a hash collision occurring, it is a risk much smaller than the risks storage administrators live with every day. So we feel safe in saying that the hash fingerprint uniquely identifies a specific block of data.

Once the block fingerprint has been calculated, the deduplication engine has to compare the fingerprint against all the other fingerprints that have previously been generated to see whether this block is unique (new) or has been processed previously (a duplicate). It is the speed at which these index search and update operations are performed that is at the heart of a deduplication system’s throughput.

The reason is that, in an in-line deduplication engine, the amount of time it takes to make the decision whether a block is new and unique or is a duplicate that has to be deduplicated translates into latency. And latency is the enemy of deduplication performance. Some simple math tells us why.

Suppose a disk drive takes 5ms to access an index entry. In a serial process, this means that the index drive can process 200 index operations every second. If the I/Os to the deduplication system’s data store are 32KB in length, the upper limit of throughput would be 32KB x 200 operations per second or 6.5 MB/second.

In practice, performance can actually be much less than that as multiple I/Os are typically required to add new fingerprints and update counters. One alternative would be to make the Hash Table memory resident. But in order to represent any reasonable amount of capacity, the hash index size will exceed the memory capacity of most servers, forcing the hash index to reside on disk. So now returning to the discussion on disk random access performance – and assuming amazingly clever algorithms for getting to exactly the right index entry in a single I/O, deduplication performance still leaves much to be desired.

There has to be a better way. Indeed, there are several better ways and as backup deduplication vendors search for new and innovative ways to circumvent the hash index processing bottlenecks, our research and development efforts have been focused on optimizing hash generation, lookup and table update for unstructured application data.


Written by BitWackr

April 7, 2010 at 9:48 pm

Posted in Uncategorized

One Response

Subscribe to comments with RSS.

  1. […] Backup Deduplication vs. Application Storage Deduplication April 2010 5 […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: