yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #83665
[Bug 1884596] Re: image import copy-image will start multiple importing threads due to race condition
Reviewed: https://review.opendev.org/743597
Committed: https://git.openstack.org/cgit/openstack/glance/commit/?id=3f6e349d0853a9746d0d744bc3eb0b2baa1ddff9
Submitter: Zuul
Branch: master
commit 3f6e349d0853a9746d0d744bc3eb0b2baa1ddff9
Author: Dan Smith <dansmith@xxxxxxxxxx>
Date: Tue Jul 28 09:02:13 2020 -0700
Implement time-limited import locking
This attempts to provide a time-based import lock that is dependent
on the task actually making progress. While the task is copying
data, the task message is updated, which in turn touches the task
updated_at time. The API will break any lock after 30 minutes of
no activity on a stalled or dead task. The import taskflow will
check to see if it has lost the lock at any point, and/or if its
task status has changed and abort if so.
The logic in more detail:
1. API locks the image by task-id before we start the task thread, but
before we return
2. Import thread will check the task-id lock on the image every time it
tries to modify the image, and if it has changed, will abort
3. The data pipeline will heartbeat the task every minute by updating
the task.message (bonus: we get some status)
4. If the data pipeline heartbeat ever finds the task state to be changed
from the expected 'processing' it will abort
5. On task revert or completion, we drop the task-id lock from the image
6. If something ever gets stuck or dies, the heartbeating will stop
7. If the API gets a request for an import where the lock is held, it
will grab the task by id (in the lock) and check the state and age.
If the age is sufficiently old (no heartbeating) and the state is
either 'processing' or terminal, it will mark the task as failed,
steal the lock, and proceed.
Lots of logging throughout any time we encounter unexpected situations.
Closes-Bug: #1884596
Change-Id: Icb3c1d27e9a514d96fca7c1d824fd2183f69d8b3
** Changed in: glance
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/1884596
Title:
image import copy-image will start multiple importing threads due to
race condition
Status in Glance:
Fix Released
Bug description:
I'm filing this bug a little prematurely because Abhi and I didn't get
a chance to fully discuss it. However, looking at the code and the
behavior I'm seeing due to another bug (1884587), I feel rather
confident.
Especially in a situation where glance is running on multiple control
plane nodes (i.e. any real-world situation), I believe there is a race
condition whereby two closely-timed requests to copy an image to a
store will result in two copy operations in glance proceeding in
parallel. I believe this to be the case due to a common "test-and-set
that isn't atomic" error.
In the API layer, glance checks that an import copy-to-store operation
isn't already in progress here:
https://github.com/openstack/glance/blob/e6db0b10a703037f754007bef6f56451086850cd/glance/api/v2/images.py#L167
And if that passes, it proceeds to setup the task as a thread here:
https://github.com/openstack/glance/blob/e6db0b10a703037f754007bef6f56451086850cd/glance/api/v2/images.py#L197
which may start running immediately or sometime in the future. Once
running, that code updates a property on the image to indicate that
the task is running here:
https://github.com/openstack/glance/blob/e6db0b10a703037f754007bef6f56451086850cd/glance/async_/flows/api_image_import.py#L479-L484
Between those two events, if another API user makes the same request,
glance will not realize that a thread is already running to complete
the initial task and will start another. In a situation where a user
spawns a thousand new instances to a thousand compute nodes in a
single operation where the image needs copying first, it's highly
plausible to have _many_ duplicate glance operations going, impacting
write performance on the rbd cluster at the very least.
As evidence that this can happen, we see an abnormally extended race
window because of the aforementioned bug (1884587) where we fail to
update the property that indicates the task is running. In a test we
see a large number of them get started, followed by a cascade of
failures when they fail to update that image property, implying that
many such threads are running. If this situation is allowed to happen
when the property does *not* fail to update, I believe we would end up
with glance copying the image to the destination in multiple threads
simultaneously. That is much harder to simulate in practice in a
development environment, but the other bug makes it happen every time
since we never update the image property to prevent it and thus the
window is long.
Abhi also brought up the case where if this race occurs on the same
node, the second attempt *may* actually start copying the partial
image in the staging directory to the destination, finish early, and
then mark the image as "copied to $store" such that nova will attempt
to use the partial image immediately, resulting in a corrupted disk
and various levels of failure after that. Note that it's not clear if
that's really possible or not, but I'm putting it here so the glance
gurus can validate.
The use of the os_glance_importing_to_stores property to "lock" a copy
to a particular store is good, except that updating that list
atomically means that the above mentioned race will not have anything
to check after the update to see if it was the race loser. I don't see
any checks in the persistence layer to ensure that an UPDATE to the
row with this property doesn't already have a given store in it, or do
any kind of merge. This also leads me to worry that two parallel
requests to copy an image to two different stores may result in
clobbering the list of stores-in-progress and potentially also the
final list of stores at rest. This is just conjecture at this point, I
just haven't seen anywhere that situation is accounted for.
To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/1884596/+subscriptions
References