What is Data Deduplication?

As data storage needs are growing in enterprises, the need to retain it for legal and business reason is also increasing concurrently. This move is making IT professionals to determine if their current data storage and backup strategies can efficiently keep up with the situation. Tapes and disks, or a combination of both are said to be the most viable option for data storage and backup. Due to rise in the volume and types of data being used at an enterprise level, there is high probability of facing challenges in the disguise of data duplication.

Data Deduplication is one such backup method, used for reducing the storage needs by eliminating redundant data. For example, a 20MB power point presentation is emailed to almost 10 people at a time. If the same email is stored by ten persons in the email directory, then 200MB of storage is allocated. If 10 recipients choose to forward the same to 10 people again, then almost 1GB of storage will be dedicated to the same single power point file. And then if we are using an incremental or a differential backup, this 1GB file will take 1GB storage in initial backup as well.

In this state, Data Deduplication identifies the data located in the file to be identical and therefore stores only one copy of the file and creates pointers to the rest. So, with this method of backup, all 20 people will be able to access the 20MB file as a centralized source and thus almost 1GB or more storage space gets saved.

Data deduplication products are also smart enough to locate the changes at sub file level; by locating changes which took place at block level, store them separately from the rest. Meaning that if those ten people made different changes in the power point file they received, storage would still be smartly used by saving only the changes instead of completely 10 copies of the presentation. Due to the presence of intelligent pointers created by the data deduplication product, each person can easily retrieve their choice of unique version, even though they are stored in separate blocks.

In short, the first uniquely stored version of a sequence is referred rather than stored again. Moreover, this process is completely hidden from users and applications and thus makes it readable after it is written.

Working of deduplication

An incoming data stream is segmented by deduplication technique, where unique identity of data segments and comparison of segments with previously stored data is conveyed by Dedup software. If in case, the incoming data is a duplicate of already stored data, a reference is created in the index and thus the segment is not stored and hence the need to allocate more storage gets reduced.

Deduplication technology types

There are two types of data deduplication in use

  • Post Process Deduplication- When the data is sent to the target device, post process deduplication comes into effect. The possible advantage in this type is dedupe process can be slow and so the time for backup is not lost waiting for deduplication to occur. But it is almost impossible to predict how long the deduplication process will take place. However, many storage manufacturers nowadays are offering time based deduplication software.
  • In-line Deduplication- In this type, hash calculations take place at the target device when the data is being written. If a duplicate is found, the new block of data is not written. The advantage of this type is that this method will reduce the need for storage to half on the target device, but the hash calculations make the data write too slow.

Uses of Data Deduplication

Deduplication is ideal to be used in online backup operations, where repeated copying and storing of same set of data files gets eliminated. So, as a result enterprises of all sizes can have a fast, reliable and cost effective data backup and recovery.

Data Deduplication Benefits

Data Deduplication brings in a bouquet of benefits to its users as

  • It saves storage space in disk to disk backups- Enterprises can see savings as the need for disks in primary backup gets reduced or the monthly charges for an off-site backup service can be cut down.
  • Reduces cooling costs- as the disk number decreases, the need for cooling solutions also gets reduced.
  • Savings in hardware footprint- as the need for more storage media gets eliminated, the amount of floor/rack space also reduces.
  • Save on bandwidth- As less data is pushed into the wire, bandwidth costs can be cut short.
  • Data restoration time decreases- Data recovery is faster due to deduplication.




This entry was posted in Tutorial and tagged , , , , , , , , , , , , . Bookmark the permalink.

Comments are closed.