BitWackr Blog

Deduplication beyond backup

Deduplicating Microsoft SharePoint Data

with one comment

Microsoft SharePoint is a collaboration tool that helps improve business effectiveness through a combination of content management, information sharing and enterprise search. It provides IT professionals a platform and the tools needed to enable server administration and interoperability.

Microsoft SQL Server, a block storage-based database, is the engine that powers SharePoint.    

Exar’s Hifn Technology BitWackr reduces the capacity required to store Microsoft Windows Server 2003 and 2008 data – including SharePoint data – through a combination of data deduplication and data compression.

BitWackr data reduction is not designed to deduplicate backup data. Deduplicating backup data is application-specific and dependent upon the data formats produced by backup software products.

The BitWackr reduces the amount of unstructured (application) data retained by an enterprise in volumes that contain primary (first copy) data for which performance is not the highest priority – the type of data held in low-to-medium-activity SharePoint storage volumes, for example.

Data Compression

Data compression is a technique that re-encodes data so that it takes up less storage space. Compression is performed by finding repeatable patterns of binary 0s and 1s meaning that the more patterns that can be found, the more the data can be compressed.

There are “lossy” and lossless” forms of data compression. Lossy compression works on the assumption that data doesn’t have to be stored perfectly. Much information can be simply thrown away from images, video data, and audio data, and when uncompressed such data will still be of acceptable quality. Lossless data compression is used when the data has to be uncompressed exactly as it was before compression. If you compress a block and then decompress it, the block is not changed. The BitWackr employs lossless compression techniques to ensure that data integrity is maintained as data is compressed and decompressed.

The metric employed in data compression is the “compression ratio”, or ratio of the size of a compressed block to the original uncompressed block. For example, suppose a block of data before compression occupies 64 kilobytes (KB) of space. Using data compression, that block may be reduced in size to, say, 32 KB, reducing by ½ the amount of capacity required to store the data. In this case, data compression reduces the size of the data file by a factor of two, resulting in a “compression ratio” of 2:1 or a “data reduction percentage” of 50%.

Some data can be highly compressed while other data will compress very slightly or even not at all. The amount of compression experienced depends on the type of data and the compression algorithm employed.

Data Deduplication

Data deduplication is a technique that eliminates redundant blocks of data. In a typical deduplication operation, blocks of data are “fingerprinted” using a hashing algorithm that produces a unique identifier for data blocks. These unique fingerprints along with the blocks of data that produced them are indexed, compressed and retained. Duplicate copies of data that have previously been fingerprinted are deduplicated, leaving only a single instance of each unique data block along with its corresponding fingerprint. 

The fingerprints along with their corresponding full data representations in a compressed form are retained to enable reconstituting the deduplicated block when the data is retrieved.

Some data can be aggressively deduplicated while other data will show little to no effect from deduplication. The level of deduplication experienced depends on the type of data being acted upon and the behavior of those storing the data.

The Sequence in which Data Reduction is Performed

The BitWackr combines compression and deduplication to reduce the capacity required to store data. Encryption is an administrator-selected option that can be invoked at the time a BitWackr volume is created.

In deduplication systems other than BitWackr, an incoming block is first hashed to extract its fingerprint. Next, in a second step, the block is compressed. Finally, in a third distinct step, the compressed block is encrypted. Note that compression always precedes encryption because the role of encryption is to introduce randomness into the data while compression operates best on data with the least randomness.

Exar’s Hifn Technology DR1605 PCIe card – the hardware component of the BitWackr – performs SHA-1 hashing, eLZS compression and if selected, AES-256 CBC encryption simultaneously in a single operation. The SHA-1 hash is used to determine whether the block being processed is unique or is a duplicate. If the block is determined to be unique, the compressed (and optionally encrypted) block is stored in the BitWackr Data Store. If the block is a duplicate, appropriate counters are updated and the next block of data is processed.

By performing hashing and data transformation (compression and encryption) block operations simultaneously, the BitWackr reduces latency in the deduplication process. This is important because latency is the enemy of deduplication system performance. 

The Combined Effect of Deduplication and Compression

Data deduplication and compression work together to produce a combined data reduction effect. Depending upon the data, the BitWackr’s deduplication algorithms can yield data reduction on the order of 10 to 80 percent for unstructured application and SharePoint data.  Compression works on the deduplicated data load as well as on blocks that do not deduplicate, shrinking the amount of capacity required to store data by as much as an additional 66 percent, so the combined data reduction and storage capacity savings over time can range up to as much as 90 percent (caution – your mileage may vary).

In order to quantify the relative effects of deduplication and compression on overall data reduction, we use a BitWackr utility program to disaggregate total data reduction into its components. Using typical business data, our observations show that about 2/3 of total BitWackr data reduction stems from data deduplication while the remaining 1/3 of the data reduction can be attributed to compression. 

Here’s a short YouTube video describing the BitWackr  for Microsoft Windows Server 2008

And here’s another short YouTube video describing BitWackr advanced data reduction for SharePoint


Written by BitWackr

April 12, 2010 at 1:53 pm

Posted in Uncategorized

One Response

Subscribe to comments with RSS.

  1. […] Deduplicating Microsoft SharePoint Data April 2010 3 […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: