← Back to team overview

openstack team mailing list archive

Re: Swift Consistency Guarantees?

 

Some general notes for consistency and swift (all of the below assumes
3 replicas):

Objects:

  When swift PUTs an object, it attempts to write to all 3 replicas
and only returns success if 2 or more replicas were written
successfully.  When a new object is created, it has a fairly strong
consistency for read after create.  The only case this would not be
true, is if all of the devices that hold the object are not available.
 When an object is  PUT on top of another object, then there is more
eventual consistency that can come in to play for failure scenarios.
This is very similar to S3's consistency model.  It is also important
to note that in the case of failure, and a device is not available for
a new replica to be written to, it will attempt to write the replica
to a handoff node.

  When swift GETs an object, by default it will return the first
object it finds from any available replicas.  Using the X-Newest
header will require swift to compare the times tamps and only serve a
replica that has the most recent time stamp.  If only one replica is
available with an older version of the object, it will be returned,
but in practice this would be quite an edge case.

Container Listings:

  When an object is PUT in to swift, each object server that a replica
is written to is also assigned one of the containers servers to
update.  On the object server, after the replica is successfully
written, an attempt will be made to update the listing of its assigned
container server.  If that update fails, it is queued locally (which
is called an async pending), to be updated out of band by another
process.  The container updater process continually looks for these
async pendings and will attempt to make the update, and will remove it
from the queue when successful.  There are many reasons that a
container update can fail (failed device, timeout, heavily used
container, etc.).  Thus container listings are eventually consistent
in all cases (which is also very similar to S3).

Consistency Window:

For objects, the biggest factor that determines the consistency window
is object replication time.  In general this is pretty quick for even
large clusters, and we are always working on making this better.  If
you want to limit consistency windows for objects, then you want to
make sure you isolate the chances of failure as much as possible.  By
setting up your zones to be as isolated as possible (separate power,
network, physical locality, etc.) you minimize the chance that there
will be a consistency window.

For containers, the biggest factor that determines the consistency
window, is disk IO for the sqlite databases.  In recent testing, basic
SATA hardware can handle somewhere in the range of 100 PUTs per second
(for smaller containers) to around 10 PUTs per second for very large
containers (millions of objects) before aync pendings start stacking
up and you begin to see consistency issues.  With better hardware (for
example RAID 10 of SSD drives), it is easy to get 400-500 PUTs per
second with containers that have a billion objects in it.  It is also
a good idea to run your container/account servers on separate hardware
than the object servers. After that, the same things for object
servers also apply to the container servers.

All that said, please don't just take my word for it, and test it for
yourself :)

--
Chuck




On Fri, Jan 20, 2012 at 2:18 PM, Nikolaus Rath <Nikolaus@xxxxxxxx> wrote:
> Hmm, but if there are e.g. 4 replicas, two of which are up-to-date but
> offline, and two available but online, swift would serve the old version?
>
> -Niko
>
>
> On 01/20/2012 03:06 PM, Chmouel Boudjnah wrote:
>> As Stephen mentionned if there is only one replica left Swift would not
>> serve it.
>>
>> Chmouel.
>>
>> On Fri, Jan 20, 2012 at 1:58 PM, Nikolaus Rath <Nikolaus@xxxxxxxx
>> <mailto:Nikolaus@xxxxxxxx>> wrote:
>>
>>     Hi,
>>
>>     Sorry for being so persistent, but I'm still not sure what happens if
>>     the 2 servers that carry the new replica are down, but the 1 server that
>>     has the old replica is up. Will GET fail or return the old replica?
>>
>>     Best,
>>     Niko
>>
>>     On 01/20/2012 02:52 PM, Stephen Broeker wrote:
>>     > By default there are 3 replicas.
>>     > A PUT Object will return after 2 replicas are done.
>>     > So if all nodes are up then there are at least 2 replicas.
>>     > If all replica nodes are down, then the GET Object will fail.
>>     >
>>     > On Fri, Jan 20, 2012 at 11:21 AM, Nikolaus Rath <Nikolaus@xxxxxxxx
>>     <mailto:Nikolaus@xxxxxxxx>
>>     > <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>>> wrote:
>>     >
>>     >     Hi,
>>     >
>>     >     So if an object update has not yet been replicated on all
>>     nodes, and all
>>     >     nodes that have been updated are offline, what will happen?
>>     Will swift
>>     >     recognize this and give me an error, or will it silently
>>     return the
>>     >     older version?
>>     >
>>     >     Thanks,
>>     >     Nikolaus
>>     >
>>     >
>>     >     On 01/20/2012 02:14 PM, Stephen Broeker wrote:
>>     >     > If a node is down, then it is ignored.
>>     >     > That is the whole point about 3 replicas.
>>     >     >
>>     >     > On Fri, Jan 20, 2012 at 10:43 AM, Nikolaus Rath
>>     <Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>
>>     >     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>>
>>     >     > <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>
>>     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>>>> wrote:
>>     >     >
>>     >     >     Hi,
>>     >     >
>>     >     >     What happens if one of the nodes is down? Especially if that
>>     >     node holds
>>     >     >     the newest copy?
>>     >     >
>>     >     >     Thanks,
>>     >     >     Nikolaus
>>     >     >
>>     >     >     On 01/20/2012 12:33 PM, Stephen Broeker wrote:
>>     >     >     > The X-Newest header can be used by a GET Operation to
>>     ensure
>>     >     that
>>     >     >     all of the
>>     >     >     > Storage Nodes (3 by default) are queried for the
>>     latest copy of
>>     >     >     the Object.
>>     >     >     > The COPY Object operation already has this functionality.
>>     >     >     >
>>     >     >     > On Fri, Jan 20, 2012 at 9:12 AM, Nikolaus Rath
>>     >     <Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>
>>     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>>
>>     >     >     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>
>>     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>>>
>>     >     >     > <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>
>>     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>>
>>     >     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>
>>     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>>>>> wrote:
>>     >     >     >
>>     >     >     >     Hi,
>>     >     >     >
>>     >     >     >     No one able to further clarify this?
>>     >     >     >
>>     >     >     >     Does swift offer there read-after-create
>>     consistence like
>>     >     >     >     non-us-standard S3? What are the precise syntax and
>>     >     semantics of
>>     >     >     >     X-Newest header?
>>     >     >     >
>>     >     >     >     Best,
>>     >     >     >     Nikolaus
>>     >     >     >
>>     >     >     >
>>     >     >     >     On 01/18/2012 10:15 AM, Nikolaus Rath wrote:
>>     >     >     >     > Michael Barton <mike-launchpad@xxxxxxxxxxxxxxxx
>>     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx>
>>     >     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx
>>     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx>>
>>     >     >     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx
>>     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx>
>>     >     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx
>>     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx>>>
>>     >     >     >     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx
>>     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx>
>>     >     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx
>>     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx>>
>>     >     >     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx
>>     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx>
>>     >     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx
>>     <mailto:mike-launchpad@xxxxxxxxxxxxxxxx>>>>> writes:
>>     >     >     >     >> On Tue, Jan 17, 2012 at 4:55 PM, Nikolaus Rath
>>     >     >     <Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>
>>     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>>
>>     >     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>
>>     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>>>
>>     >     >     >     <mailto:Nikolaus@xxxxxxxx
>>     <mailto:Nikolaus@xxxxxxxx> <mailto:Nikolaus@xxxxxxxx
>>     <mailto:Nikolaus@xxxxxxxx>>
>>     >     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>
>>     <mailto:Nikolaus@xxxxxxxx <mailto:Nikolaus@xxxxxxxx>>>>> wrote:
>>     >     >     >     >>> Amazon S3 and Google Storage make very
>>     explicit (non-)
>>     >     >     consistency
>>     >     >     >     >>> guarantees for stored objects. I'm looking for
>>     a similar
>>     >     >     >     documentation
>>     >     >     >     >>> about OpenStack's Swift, but haven't had much
>>     success.
>>     >     >     >     >>
>>     >     >     >     >> I don't think there's any documentation on
>>     this, but
>>     >     it would
>>     >     >     >     probably
>>     >     >     >     >> be good to write up.  Consistency in Swift is very
>>     >     similar
>>     >     >     to S3.
>>     >     >     >     >> That is, there aren't many non-eventual consistency
>>     >     guarantees.
>>     >     >     >     >>
>>     >     >     >     >> Listing updates can happen asynchronously
>>     (especially
>>     >     under
>>     >     >     >     load), and
>>     >     >     >     >> older versions of files can show up in requests
>>     (deletes
>>     >     >     are just a
>>     >     >     >     >> new "deleted" version of the file).
>>     >     >     >     >
>>     >     >     >     > Ah, ok. Thanks a lot for stating this so explicitly.
>>     >     There seems
>>     >     >     >     to be a
>>     >     >     >     > lot of confusion about this, now I can at least
>>     point
>>     >     people to
>>     >     >     >     > something.
>>     >     >     >     >
>>     >     >     >     >> Swift can generally be relied on for
>>     read-after-write
>>     >     >     consistency,
>>     >     >     >     >> like S3's regions other than the the US
>>     Standard region.
>>     >     >      The reason
>>     >     >     >     >> S3 in US Standard doesn't have this guarantee
>>     is because
>>     >     >     it's more
>>     >     >     >     >> geographically widespread - something Swift
>>     isn't good at
>>     >     >     yet.  I can
>>     >     >     >     >> imagine we'll have the same limitation when we
>>     get there.
>>     >     >     >     >
>>     >     >     >     > Do you mean read-after-create consistency? Because
>>     >     below you
>>     >     >     say about
>>     >     >     >     > read-after-write:
>>     >     >     >     >
>>     >     >     >     >>> - If I receive a (non-error) response to a PUT
>>     >     request, am I
>>     >     >     >     guaranteed
>>     >     >     >     >>> that the object will be immediately included
>>     in all
>>     >     object
>>     >     >     >     listings in
>>     >     >     >     >>> every possible situation?
>>     >     >     >     >>
>>     >     >     >     >> Nope.
>>     >     >     >     >
>>     >     >     >     > ..so is there such a guarantee for PUTs of *new*
>>     objects
>>     >     >     (like S3 non
>>     >     >     >     > us-classic), or does "can generally be relied
>>     on" just
>>     >     mean
>>     >     >     that the
>>     >     >     >     > chances for new puts are better?
>>     >     >     >     >
>>     >     >     >     >> Also like S3, Swift can't make any strong
>>     guarantees
>>     >     about
>>     >     >     >     >> read-after-update or read-after-delete consistency.
>>     >      We do
>>     >     >     have an
>>     >     >     >     >> "X-Newest" header that can be added to GETs and
>>     HEADs to
>>     >     >     make the
>>     >     >     >     >> proxy do a quorum of backend servers and return the
>>     >     newest
>>     >     >     available
>>     >     >     >     >> version, which greatly improves these, at the
>>     cost of
>>     >     latency.
>>     >     >     >     >
>>     >     >     >     > That sounds very interesting. Could you give
>>     some more
>>     >     >     details on what
>>     >     >     >     > exactly is guaranteed when using this header?
>>     What happens
>>     >     >     if the
>>     >     >     >     server
>>     >     >     >     > having the newest copy is down?
>>     >     >     >     >
>>     >     >     >     >>> - If the swift server looses an object, will the
>>     >     object name
>>     >     >     >     still be
>>     >     >     >     >>> returned in object listings? Will attempts to
>>     >     retrieve it
>>     >     >     result
>>     >     >     >     in 404
>>     >     >     >     >>> errors (as if it never existed) or a different
>>     error?
>>     >     >     >     >>
>>     >     >     >     >> It will show up in listings, but give a 404
>>     when you
>>     >     attempt to
>>     >     >     >     >> retrieve it.  I'm not sure how we can improve that
>>     >     with Swift's
>>     >     >     >     >> general model, but feel free to make suggestions.
>>     >     >     >     >
>>     >     >     >     > From an application programmers point of view, it
>>     >     would be very
>>     >     >     >     helpful
>>     >     >     >     > if lost objects could be distinguished from
>>     non-existing
>>     >     >     object by a
>>     >     >     >     > different HTTP error. Trying to access a
>>     non-existing
>>     >     object may
>>     >     >     >     > indicate a bug in the application, so it would
>>     be nice to
>>     >     >     know when it
>>     >     >     >     > happens.
>>     >     >     >     >
>>     >     >     >     > Also, it would be very helpful if there was a
>>     way to list
>>     >     >     all lost
>>     >     >     >     > objects without having to issue HEAD requests
>>     for every
>>     >     >     stored object.
>>     >     >     >     > Could this information be added to the XML and JSON
>>     >     output of
>>     >     >     >     container
>>     >     >     >     > listings? Then an application would have the
>>     chance to
>>     >     >     periodically
>>     >     >     >     > check for lost data, rather than having to
>>     handle all lost
>>     >     >     objects at
>>     >     >     >     > the instant they're required.
>>     >     >     >     >
>>     >     >     >     >
>>     >     >     >     > I am working on a swift backend for S3QL
>>     >     >     >     > (http://code.google.com/p/s3ql/), a program that
>>     exposes
>>     >     >     online cloud
>>     >     >     >     > storage as a local UNIX file system. To prevent data
>>     >     >     corruption, there
>>     >     >     >     > are two requirements that I'm currently
>>     struggling to
>>     >     >     provide with the
>>     >     >     >     > swift backend:
>>     >     >     >     >
>>     >     >     >     > - There needs to be a way to reliably check if
>>     one object
>>     >     >     (holding the
>>     >     >     >     >   file system metadata) is the newest version.
>>     >     >     >     >
>>     >     >     >     >   The S3 backend does this by requiring storage
>>     in the non
>>     >     >     us-classic
>>     >     >     >     >   regions and using list-after-create
>>     consistency with a
>>     >     >     marker object
>>     >     >     >     >   that has has a "generation number" of the metadata
>>     >     >     embedded in its
>>     >     >     >     >   name.
>>     >     >     >     >
>>     >     >     >     >   I'm not yet sure if this would work with swift
>>     as well
>>     >     >     (the google
>>     >     >     >     >   storage backend just relies on the strong
>>     >     read-after-write
>>     >     >     >     >   consistency).
>>     >     >     >     >
>>     >     >     >     > - The file system checker needs a way to
>>     identify lost
>>     >     objects.
>>     >     >     >     >
>>     >     >     >     >   Here the S3 backend just relies on the durability
>>     >     >     guarantee that
>>     >     >     >     >   effectively no object will ever be lost.
>>     >     >     >     >
>>     >     >     >     >   Again, I'm not sure how to implement this for
>>     swift.
>>     >     >     >     >
>>     >     >     >     >
>>     >     >     >     > Any suggestions?
>>     >     >     >     >
>>     >     >     >     >
>>     >     >     >     >
>>     >     >     >     > Best,
>>     >     >     >     >
>>     >     >     >     >    -Nikolaus
>>     >     >     >     >
>>     >     >     >
>>     >     >     >
>>     >     >     >       -Nikolaus
>>     >     >     >
>>     >     >     >     --
>>     >     >     >      »Time flies like an arrow, fruit flies like a
>>     Banana.«
>>     >     >     >
>>     >     >     >      PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF
>>     A9AD B7F8
>>     >     >     AE4E 425C
>>     >     >     >
>>     >     >     >     _______________________________________________
>>     >     >     >     Mailing list: https://launchpad.net/~openstack
>>     >     >     >     Post to     : openstack@xxxxxxxxxxxxxxxxxxx
>>     <mailto:openstack@xxxxxxxxxxxxxxxxxxx>
>>     >     <mailto:openstack@xxxxxxxxxxxxxxxxxxx
>>     <mailto:openstack@xxxxxxxxxxxxxxxxxxx>>
>>     >     >     <mailto:openstack@xxxxxxxxxxxxxxxxxxx
>>     <mailto:openstack@xxxxxxxxxxxxxxxxxxx>
>>     >     <mailto:openstack@xxxxxxxxxxxxxxxxxxx
>>     <mailto:openstack@xxxxxxxxxxxxxxxxxxx>>>
>>     >     >     >     <mailto:openstack@xxxxxxxxxxxxxxxxxxx
>>     <mailto:openstack@xxxxxxxxxxxxxxxxxxx>
>>     >     <mailto:openstack@xxxxxxxxxxxxxxxxxxx
>>     <mailto:openstack@xxxxxxxxxxxxxxxxxxx>>
>>     >     >     <mailto:openstack@xxxxxxxxxxxxxxxxxxx
>>     <mailto:openstack@xxxxxxxxxxxxxxxxxxx>
>>     >     <mailto:openstack@xxxxxxxxxxxxxxxxxxx
>>     <mailto:openstack@xxxxxxxxxxxxxxxxxxx>>>>
>>     >     >     >     Unsubscribe : https://launchpad.net/~openstack
>>     >     >     >     More help   : https://help.launchpad.net/ListHelp
>>     >     >     >
>>     >     >     >
>>     >     >
>>     >     >
>>     >     >       -Nikolaus
>>     >     >
>>     >     >     --
>>     >     >      »Time flies like an arrow, fruit flies like a Banana.«
>>     >     >
>>     >     >      PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8
>>     >     AE4E 425C
>>     >     >
>>     >     >
>>     >
>>     >
>>     >       -Nikolaus
>>     >
>>     >     --
>>     >      »Time flies like an arrow, fruit flies like a Banana.«
>>     >
>>     >      PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8
>>     AE4E 425C
>>     >
>>     >
>>
>>
>>       -Nikolaus
>>
>>     --
>>      »Time flies like an arrow, fruit flies like a Banana.«
>>
>>      PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8 AE4E 425C
>>
>>     _______________________________________________
>>     Mailing list: https://launchpad.net/~openstack
>>     Post to     : openstack@xxxxxxxxxxxxxxxxxxx
>>     <mailto:openstack@xxxxxxxxxxxxxxxxxxx>
>>     Unsubscribe : https://launchpad.net/~openstack
>>     More help   : https://help.launchpad.net/ListHelp
>>
>>
>
>
>   -Nikolaus
>
> --
>  »Time flies like an arrow, fruit flies like a Banana.«
>
>  PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6  02CF A9AD B7F8 AE4E 425C
>
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp


Follow ups

References