yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #88844
[Bug 1939920] Re: Compute node deletes itself if rebooted without DNS
** No longer affects: nova
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1939920
Title:
Compute node deletes itself if rebooted without DNS
Status in OpenStack Nova Compute Charm:
New
Bug description:
Reproduced on: bionic-queens, focal-wallaby
A normal-running nova-compute service with instances can have its DB
suffer drastic damage by having its FQDN changed due to external
factors that may be beyond control and always have some chance of
happening, such a network outage issue or DNS server issue.
What happens is that the code at [0] deletes the compute node entry in
nova.compute_nodes table because the FQDN is "different" when such an
external problem happens. In fact, it changes from:
"juju-b93c20-bq-6.maas" to "juju-b93c20-bq-6", whereas
"juju-b93c20-bq-6" is unchanged and saved in the nova.compute_nodes
table in the "host" field. I believe this could be used to prevent
this issue.
So because the FQDN is different, the nova-compute service believes it
is a different service and the old one registered is an orphan, and
then a cascading series of mistakes follow:
1) Deletes itself from the nova.compute_nodes table
*2) Deletes the allocations from the old resource provider in nova_api/placement.allocations
*3) Deletes the resource provider in nova_api/placement.resource_providers
4) Registers a new compute node in nova.compute_nodes
5) Registers a new empty resource provider in nova_api/placement.resource_providers
* In queens my compute service was successfully able to perform those
steps, but in wallaby I got the following errors, under the same
circumstances.
2021-08-13 19:37:08.636 3300 DEBUG nova.scheduler.client.report [req-c36eee09-c105-4877-b33b-76944f7ace89 - - - - -] Cannot delete allocation for ['581fdcc1-0a47-4dc4-8598-a6ae4fb13a9f'] consumer in placement as consumer does not exist delete_allocation_for_instance /usr/lib/python3/dist-packages/nova/scheduler/client/report.py:2100
2021-08-13 19:37:08.685 3300 ERROR nova.scheduler.client.report [req-c36eee09-c105-4877-b33b-76944f7ace89 - - - - -] [req-a52e1950-e3a3-4985-bc61-9080ba41afcb] Failed to delete resource provider with UUID fcbe200d-bf36-49d4-822a-0f11be3cc392 from the placement API. Got 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to delete resource provider fcbe200d-bf36-49d4-822a-0f11be3cc392: Resource provider has allocations. ", "request_id": "req-a52e1950-e3a3-4985-bc61-9080ba41afcb"}]}.
The series of cascading issues continues as after step (5) above the
node behaves "normally", therefore the customer creates more
instances, and later when the node is later restarted, it reverts to
its old FQDN, and repeats the problem again, however, a bit
differently in queens and wallaby:
wallaby: It fails to re-create the resource provider, as it had not
successfully deleted the old one.
nova.exception.ResourceProviderCreationFailed: Failed to create
resource provider juju-f61af6-fw-8.maas. Therefore at this point it is
no longer able to create instances on this node because Placement will
no longer report it as a candidate (as it is not registered with its
new compute_node uuid).
queens: It repeats steps 1-5, so new VMs get their allocations deleted
as well, and the node is functional after another restart with its
FQDN restored.
So in queens it is usable after FQDN is restored, while in wallaby it
is not, and in both cases DB surgery is needed to fix all
inconsistencies.
In the end, this issue is very annoying and it causes a lot of
inconsistencies in the DB that need to be repaired through DB surgery,
for such an external problem that is sometimes beyond control and has
some chance of happening.
I've seen this happen many times with customers but hadn't been able
to pinpoint the root cause because I used to just notice a lot of
allocations issues (more specifically instances running without
allocations) a long time after the FQDN problem had happened, in which
the customer had already performed many different changes to restore
functionality, while being unaware that allocations were inconsistent,
and then raising other problems such as not able being to properly
create instances some time in the future, as a consequence of the
missing allocation entries in nova_api/placement.allocations DB table.
[0] https://github.com/openstack/nova/blob/b0099aa8a28a79f46cfc79708dcd95f07c1e685f/nova/compute/manager.py#L9997
Steps to reproduce:
===================
Variation 1
~~~~~~~~~~~
- edit /etc/hosts
- add your IP, FQDN and hostname similar to example below
10.5.0.134 juju-b93c20-bq-6.maas5 juju-b93c20-bq-6
Edit the FQDN to make it slightly different (in this example the
correct was maas, I changed it to maas5)
- restart nova-compute service
Variation 2
~~~~~~~~~~~
- edit your network configuration to change dhcp to static IP, make sure to not include DNS or gateway, just the IP and submask
- reboot node
To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nova-compute/+bug/1939920/+subscriptions
References