yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #73994
[Bug 1784020] [NEW] Shared storage providers are not supported and will break things if used
Public bug reported:
https://review.openstack.org/#/c/560459/ in Rocky changed the libvirt
driver such that if the compute node provider is in a shared storage
provider aggregate relationship (in the same aggregate with a resource
provider that has DISK_GB inventory and the MISC_SHARES_VIA_AGGREGATE
trait), the compute node provider won't report DISK_GB inventory.
There are at least two major issues with this:
1. On upgrade from Queens, any existing allocations against the compute
node provider's DISK_GB inventory will not allow removal of the DISK_GB
inventory from the compute node provider during the
update_available_resource periodic task. In other words, we have no data
migration routine in place to move DISK_GB allocations from the compute
node provider to the shared storage provider in Rocky.
2. During a move operation, we move the instance's allocations from the
source compute node provider to the migration record, then go through
the scheduler to pick a dest host for the instance and allocate
resources against the dest host (and optionally shared storage
provider). So:
a) The DISK_GB allocation from the instance to the shared storage
provider is deleted for a short window of time during scheduling until
we pick a dest host.
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/tasks/migrate.py#L57
b) If cold migrate fails or is reverted, we delete the allocations
(created by the scheduler) and move the allocations from the migration
record (against the source node provider) back to the instance, but
because we failed to move the DISK_GB allocation against the sharing
provider for the instance to the migration record, we've lost that
DISK_GB allocation when copying it back to the instance on
revert/failure:
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/manager.py#L4155
--
We could also have issues with how forced live migrate:
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/tasks/live_migrate.py#L109
And evacuate:
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L868
bypass the scheduler altogether so we're potentially not handling shared
provider allocations there either.
Also, we don't have *any* shared storage provider CI jobs setup. A start
to that is here:
https://review.openstack.org/#/c/586363/
But that's just a single-node job at the moment and we'd need a multi-
node shared storage CI job to really say we support shared storage
providers as a feature in nova.
** Affects: nova
Importance: High
Status: Triaged
** Tags: libvirt placement rocky-rc-potential shared-storage
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1784020
Title:
Shared storage providers are not supported and will break things if
used
Status in OpenStack Compute (nova):
Triaged
Bug description:
https://review.openstack.org/#/c/560459/ in Rocky changed the libvirt
driver such that if the compute node provider is in a shared storage
provider aggregate relationship (in the same aggregate with a resource
provider that has DISK_GB inventory and the MISC_SHARES_VIA_AGGREGATE
trait), the compute node provider won't report DISK_GB inventory.
There are at least two major issues with this:
1. On upgrade from Queens, any existing allocations against the
compute node provider's DISK_GB inventory will not allow removal of
the DISK_GB inventory from the compute node provider during the
update_available_resource periodic task. In other words, we have no
data migration routine in place to move DISK_GB allocations from the
compute node provider to the shared storage provider in Rocky.
2. During a move operation, we move the instance's allocations from
the source compute node provider to the migration record, then go
through the scheduler to pick a dest host for the instance and
allocate resources against the dest host (and optionally shared
storage provider). So:
a) The DISK_GB allocation from the instance to the shared storage
provider is deleted for a short window of time during scheduling until
we pick a dest host.
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/tasks/migrate.py#L57
b) If cold migrate fails or is reverted, we delete the allocations
(created by the scheduler) and move the allocations from the migration
record (against the source node provider) back to the instance, but
because we failed to move the DISK_GB allocation against the sharing
provider for the instance to the migration record, we've lost that
DISK_GB allocation when copying it back to the instance on
revert/failure:
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/manager.py#L4155
--
We could also have issues with how forced live migrate:
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/tasks/live_migrate.py#L109
And evacuate:
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L868
bypass the scheduler altogether so we're potentially not handling
shared provider allocations there either.
Also, we don't have *any* shared storage provider CI jobs setup. A
start to that is here:
https://review.openstack.org/#/c/586363/
But that's just a single-node job at the moment and we'd need a multi-
node shared storage CI job to really say we support shared storage
providers as a feature in nova.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1784020/+subscriptions
Follow ups