← Back to team overview

openstack team mailing list archive

Re: Enabling data deduplication on Swift


Andi abes asked: 

> Doesn't that depend on the ratios of read vs write?
> In a read tilted environment (e.g. CDN's, image stores etc), being able to dedup at the block level in the
> relatively rare write case seems a boon. The simplification this could allow - performing localized dedup
> (i.e. each object server deduping just its local storage) seems worth while.

For the most part deduplication has no impact on read performance. The same chunks will be fetched
whether they were de-duplicated or not.

If you have a central metadata system (like GFS or HDFS) then deduplication can impair optimizing the location
of the chunks for streaming reads. But with hash driven algorithms you either place the entire object on one
server, which will preclude parallelizing the fetch, or you distribute the object's chunks to multiple servers 
which will impair the efficiency of a slow streaming read.

Because distributed deduplication relies on fingerprinted chunks it has the advantage of allowing unrestricted
Chunk caching, which is the real solution to optimizing reads of extremely popular data.