yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1939920] Re: Compute node deletes itself if rebooted without DNS

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Rodrigo Barbieri <1939920@xxxxxxxxxxxxxxxxxx>
Date: Wed, 11 May 2022 17:10:28 -0000
Reply-to: Bug 1939920 <1939920@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
** No longer affects: nova

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1939920

Title:
  Compute node deletes itself if rebooted without DNS

Status in OpenStack Nova Compute Charm:
  New

Bug description:
  Reproduced on: bionic-queens, focal-wallaby

  A normal-running nova-compute service with instances can have its DB
  suffer drastic damage by having its FQDN changed due to external
  factors that may be beyond control and always have some chance of
  happening, such a network outage issue or DNS server issue.

  What happens is that the code at [0] deletes the compute node entry in
  nova.compute_nodes table because the FQDN is "different" when such an
  external problem happens. In fact, it changes from:

  "juju-b93c20-bq-6.maas" to "juju-b93c20-bq-6", whereas
  "juju-b93c20-bq-6" is unchanged and saved in the nova.compute_nodes
  table in the "host" field. I believe this could be used to prevent
  this issue.

  So because the FQDN is different, the nova-compute service believes it
  is a different service and the old one registered is an orphan, and
  then a cascading series of mistakes follow:

  1) Deletes itself from the nova.compute_nodes table
  *2) Deletes the allocations from the old resource provider in nova_api/placement.allocations
  *3) Deletes the resource provider in nova_api/placement.resource_providers
  4) Registers a new compute node in nova.compute_nodes
  5) Registers a new empty resource provider in nova_api/placement.resource_providers

  * In queens my compute service was successfully able to perform those
  steps, but in wallaby I got the following errors, under the same
  circumstances.

  2021-08-13 19:37:08.636 3300 DEBUG nova.scheduler.client.report [req-c36eee09-c105-4877-b33b-76944f7ace89 - - - - -] Cannot delete allocation for ['581fdcc1-0a47-4dc4-8598-a6ae4fb13a9f'] consumer in placement as consumer does not exist delete_allocation_for_instance /usr/lib/python3/dist-packages/nova/scheduler/client/report.py:2100
  2021-08-13 19:37:08.685 3300 ERROR nova.scheduler.client.report [req-c36eee09-c105-4877-b33b-76944f7ace89 - - - - -] [req-a52e1950-e3a3-4985-bc61-9080ba41afcb] Failed to delete resource provider with UUID fcbe200d-bf36-49d4-822a-0f11be3cc392 from the placement API. Got 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to delete resource provider fcbe200d-bf36-49d4-822a-0f11be3cc392: Resource provider has allocations.  ", "request_id": "req-a52e1950-e3a3-4985-bc61-9080ba41afcb"}]}.

  The series of cascading issues continues as after step (5) above the
  node behaves "normally", therefore the customer creates more
  instances, and later when the node is later restarted, it reverts to
  its old FQDN, and repeats the problem again, however, a bit
  differently in queens and wallaby:

  wallaby: It fails to re-create the resource provider, as it had not
  successfully deleted the old one.
  nova.exception.ResourceProviderCreationFailed: Failed to create
  resource provider juju-f61af6-fw-8.maas. Therefore at this point it is
  no longer able to create instances on this node because Placement will
  no longer report it as a candidate (as it is not registered with its
  new compute_node uuid).

  queens: It repeats steps 1-5, so new VMs get their allocations deleted
  as well, and the node is functional after another restart with its
  FQDN restored.

  So in queens it is usable after FQDN is restored, while in wallaby it
  is not, and in both cases DB  surgery is needed to fix all
  inconsistencies.

  In the end, this issue is very annoying and it causes a lot of
  inconsistencies in the DB that need to be repaired through DB surgery,
  for such an external problem that is sometimes beyond control and has
  some chance of happening.

  I've seen this happen many times with customers but hadn't been able
  to pinpoint the root cause because I used to just notice a lot of
  allocations issues (more specifically instances running without
  allocations) a long time after the FQDN problem had happened, in which
  the customer had already performed many different changes to restore
  functionality, while being unaware that allocations were inconsistent,
  and then raising other problems such as not able being to properly
  create instances some time in the future, as a consequence of the
  missing allocation entries in nova_api/placement.allocations DB table.

  
  [0] https://github.com/openstack/nova/blob/b0099aa8a28a79f46cfc79708dcd95f07c1e685f/nova/compute/manager.py#L9997

  Steps to reproduce:
  ===================

  Variation 1
  ~~~~~~~~~~~
  - edit /etc/hosts
  - add your IP, FQDN and hostname similar to example below

  10.5.0.134 juju-b93c20-bq-6.maas5 juju-b93c20-bq-6

  Edit the FQDN to make it slightly different (in this example the
  correct was maas, I changed it to maas5)

  - restart nova-compute service

  Variation 2
  ~~~~~~~~~~~
  - edit your network configuration to change dhcp to static IP, make sure to not include DNS or gateway, just the IP and submask
  - reboot node

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-nova-compute/+bug/1939920/+subscriptions
References

[Bug 1939920] [NEW] Compute node deletes itself if rebooted without DNS
From: Rodrigo Barbieri, 2021-08-13