← Back to team overview

openstack team mailing list archive

Re: Swift block-level deduplication

 

Eoghan Glynn asked:

> My question is whether either or both these approaches involve active client participation in enabling duplicate chunk detection?

> One could see a spectrum ranging between:

>1. Client actively breaks the object into chunks, selects the
>   hashing algorithm, calculates fingerprint and then only uploads
>   if Swift reports that fingerprint is unknown.

>2. Client determines which objects are worth deduping, maybe has
>   some influence on chunk size and/or hashing, but fingerprint
>   calculation is all handled internally by Swift.

>3. Client is entirely uninvolved, deduplication is handled
>  transparently in the object storage layer and enabled either
>   globally or per-container.

The versioning/dedup ring we are working on at Nexenta will support both 1 and 3. I'll be presenting at the Summit on this.

The ultimate goal of distributed dedup is scenario #1. Only the client software can determine the optimum chunk boundaries,
and transferring over the network *before* doing deduplication means that the only savings you get from dedup is in
disk storage and bandwidth. The network bandwidth is far more likely to be a bottleneck.

But you have to support #3 for compatibility. You cannot expect that clients will be ready to do these steps the second you
deploy your new solution. You can only make the option available.

One very viable solution is to deploy a storage proxy as a VM on the same physical host as the client. The "bandwidth" between
The client VM and the proxy VM is not worth saving, and you can apply the dedup algorithm in the proxy rather than in the client
Itself. You don't get ideal chunking, but you do get a much easier deployment model.



Follow ups

References