openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #04262
Identical Timestamps
Examining the Proxy and Object Server code I believe there is a problem
when two Proxies
attempt to update the same object at the same exact time (i.e. the two
proxies have identical
timestamps for the transaction).
For PUT, POST and DELETE the Object Server will rename the temporary
file to <timestamp>.[data|ts]
even if <timestamp> already exists. However in no case will the
container update be done unless
orig_timestamp is missing or less than the new timestamp.
So a concurrent PUT and DELETE will:
* Result in *both* the .data and .ts file being created, and *neither*
being deleted as "old".
Since ".ts" is "later" in a sort than ".data" the delete will be
effective for subsequent gets.
* The container will be updated by each Object Server *once*, but
different Object Servers
may receive the concurrent transactions in varying orders. The
Container Server will end up
with the Object as perceived by the first transaction on the last
Object Server (essentially
arbitrary).
Even if all three Object Servers perform the two transactions in the
same order, the result can
be an object that is effectively deleted on the Object Servers but still
listed in the Container.
With two concurrent PUTs, the latter PUT renames the tempfile to
<timestamp>.data, but Only
he original transaction updates the Container.
This can result in different versions of Objects on different servers,
and almost certainly will result
in the etag held by the Object Server not being in sync with the etag
held for the Object by the
Container Server.
Changing the test "orig_timestamp <" to "orig_tiimestamp <=" does not
really solve the Problem,
it will just make it harder to catch.
This is because while any one Object Server will now be consistent in
terms of its interactions with
the Container Server you could still have two Proxy Servers submit
updates to Object X with
Timestamp Y and have *both* succeed, and have some of the Object Servers
have the object as
put from Proxy A while others will have it as put by Proxy B.
I believe the intent was for the Auditor to catch this by comparing the
etag for each Object in the
Container DB with the actual etag. But I can find no code that
references the etag in the Container DB.
The Object Auditor compares the calculated MD5 versus the etag stored as
metadata for that file.
If the auditor were to cross-validate the etag AND the check in the
object servers was changed from
"orig_timestamp <" to "oriig_timestamp <=" then the result would be
eventual consistency.
There would be a period of reduced resiliency before the incorrectly
updated Object Servers were
repaired by the Auditor, but given the extremely low frequency of
identical timestamps this would
probably be acceptable.
However such a solution would still leave a problem. At most one file
can be <timestamp>.data.
If the "retain old versions" option is enabled then the older of the two
timestamp X versions cannot
be retained. If Swift were used as a Document retention system this
would be very undesirable.
If a put is successful the version put should be retained even if it is
not the most recent version.
Supporting full versioning for clients would be a relatively easy
enhancement for Swift, but not
If at most one revision at time X can be retained.
A better solution would be to ensure that there can only be one
<timestamp>.<extension> file Created,
which involves recognizing that the timestamp is being used as an
increasing version number (albeit not
an monotonically increasing version number).
Without relying on a single server for a given object the best way to
create a unique version number
would be to extend <timestamp> with something that would be different
for two Proxy servers putting
two different versions of the Object, albeit at the same instant:
* A proxy ID. This would preferably be a short configured ID, but any IP
address assigned to the Proxy server
could serve as a unique extension.
* The md5 hash of the payload, or potentially n bits of it.
My reading of the current code is that either extension would co-exist
with the current unextended timestamps,
they just sort later than the current timestamp with the identical base
timestamp. No change would be required
to the GET/HEAD logic, only to PUT/POST/DELETE.