yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #87074
[Bug 1941861] Re: Retired and disabled compute service preventing start of all compute services after upgrade (TooOldComputeService exception)
One of the goal of the utils.raise_if_old_compute() is exactly to detect
that you have old computes in your environment. Base on the logs you had
a really old compute record, you cannot delete it as it predates
placement and therefore there is no corresponding resource provider in
the placement database. I would say it is all expected. Nova does not
support (never supported) environments with controllers on version N
while the compute was older than N-1. Hence your state was basically
unsupported. Therefore I marking this bug as Invalid. Feel free to
reopen it if you disagree.
In your environment, regardless of the utils.raise_if_old_compute()
check, your RPC between controllers and computes was pinned to a waaay
old version due to the old, inactive, compute being recorded in the DB.
This probably caused performance issues or even bugs. The good way to
fix it is to remove the old compute as you did it. Now you are in a lot
cleaner state than before.
Regarding "get_minimum_version_all_cells does not check disabled
services". A disable compute means that the compute exists and might be
re-enabled any time. So ignoring the version of it would cause that the
RPC would be pinned in a higher version then when the disable compute
gets re-enabled that compute will not be able to communicate due to the
too high RPC version. So no, we cannot ignore disabled computes. If a
compute is not needed by a deployment then it should be deleted not just
disabled.
** Tags added: compute upgrade
** Changed in: nova
Status: New => Invalid
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1941861
Title:
Retired and disabled compute service preventing start of all compute
services after upgrade (TooOldComputeService exception)
Status in OpenStack Compute (nova):
Invalid
Bug description:
Description
===========
After we upgraded our cluster to wallaby on Ubuntu 20.04, the compute services were down on all compute servers, as well as nova-scheduler and nova-conductor.
When checking the log, the error was:
2021-08-26 19:26:20.932 59357 CRITICAL nova [req-a8c9b7b2-f4d2-4b68-aeb2-8fb96e9ffb20 - - - - -] Unhandled error: nova.exception.TooOldComputeService: Current Nova version does not support computes older than Victoria but the minimum compute service level in your system is 35 and the oldest supported service level is 52.
2021-08-26 19:26:20.932 59357 ERROR nova Traceback (most recent call last):
2021-08-26 19:26:20.932 59357 ERROR nova File "/usr/bin/nova-conductor", line 10, in <module>
2021-08-26 19:26:20.932 59357 ERROR nova sys.exit(main())
2021-08-26 19:26:20.932 59357 ERROR nova File "/usr/lib/python3/dist-packages/nova/cmd/conductor.py", line 45, in main
2021-08-26 19:26:20.932 59357 ERROR nova server = service.Service.create(binary='nova-conductor',
2021-08-26 19:26:20.932 59357 ERROR nova File "/usr/lib/python3/dist-packages/nova/service.py", line 264, in create
2021-08-26 19:26:20.932 59357 ERROR nova utils.raise_if_old_compute()
2021-08-26 19:26:20.932 59357 ERROR nova File "/usr/lib/python3/dist-packages/nova/utils.py", line 1101, in raise_if_old_compute
2021-08-26 19:26:20.932 59357 ERROR nova raise exception.TooOldComputeService(
2021-08-26 19:26:20.932 59357 ERROR nova nova.exception.TooOldComputeService: Current Nova version does not support computes older than Victoria but the minimum compute service level in your system is 35 and the oldest supported service level is 52.
2021-08-26 19:26:20.932 59357 ERROR nova
However, we had upgraded all compute servers so this was a bit unexpected
The only suspect from "openstack compute service list" was a line with
an old "updated At" time, but it was on a retired server (not in the
DNS anymore, and not in the host list ) - trying to remove it yields a
bug (that I could report separately) :
root@controller:~# openstack compute service delete 21
Failed to delete compute service with ID '21': Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'ValueError'> (HTTP 500) (Request-ID: req-4390ef9c-3a45-44b2-92bc-db1923ffc83a)
1 of 1 compute services failed to delete.
root@controller:~#
logs:
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi [req-4390ef9c-3a45-44b2-92bc-db1923ffc83a 798bbc2bfd3e4123804ea493d6bf2197 0096b32340674f4cb9101354ba6a454c - 4820ac059d2f4f56a4e02d68982b9e71 4820ac059d2f4f56a4e02d68982b9e71] Unexpected exception in API method: ValueError: No such provider 3f7b9fde-9100-471f-bb3b-b65c607b5f84
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi Traceback (most recent call last):
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/api/openstack/wsgi.py", line 658, in wrapped
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/api/openstack/compute/services.py", line 286, in delete
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi self.placementclient.delete_resource_provider(
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/scheduler/client/report.py", line 2257, in delete_resource_provider
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi provider_uuids = self._provider_tree.get_provider_uuids_in_tree(
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/compute/provider_tree.py", line 288, in get_provider_uuids_in_tree
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi return self._find_with_lock(
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/compute/provider_tree.py", line 439, in _find_with_lock
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi raise ValueError(_("No such provider %s") % name_or_uuid)
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi ValueError: No such provider 3f7b9fde-9100-471f-bb3b-b65c607b5f84
2021-08-27 10:50:10.059 95744 ERROR nova.api.openstack.wsgi
2021-08-27 10:50:10.062 95744 INFO nova.api.openstack.wsgi [req-4390ef9c-3a45-44b2-92bc-db1923ffc83a 798bbc2bfd3e4123804ea493d6bf2197 0096b32340674f4cb9101354ba6a454c - 4820ac059d2f4f56a4e02d68982b9e71 4820ac059d2f4f56a4e02d68982b9e71] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'ValueError'>
Then:
- Removing the utils.raise_if_old_compute() line code allowed the
service on that machine to start
- I then checked the database directly and the only occurrence of "35"
related to services was indeed that disabled/undeletable service.
After a "[nova]> update services set version = 53 where id = 21;" all
services started.
Steps to reproduce
==================
Hard to tell from a new installation, since it requires a cluster with
some history of old machines and probably an issue some day of a
retired server with its compute service still kept in the database.
Expected result
===============
-> "get_minimum_version_all_cells" does not check disabled services
-> cluster services start normally
and/or
-> information from the log on *where* is the problematic service
-> ability to delete it from commandline
Actual result
=============
-> service do check disabled services on start
-> log message is cryptic: "35" appears nowhere in openstack commands describing services (it's just an internal number, openstack commands apparently convert it to a *date* )
-> Even guessing the issue, it's impossible to delete the service
Environment
===========
cloud-archive:wallaby on standard 20.04 server:
root@controller:~# dpkg -l | grep nova
ii nova-api 3:23.0.1-0ubuntu1~cloud0 all OpenStack Compute - API frontend
rc nova-cert 2:15.1.5-0ubuntu1~cloud0 all OpenStack Compute - certificate management
ii nova-common 3:23.0.1-0ubuntu1~cloud0 all OpenStack Compute - common files
ii nova-conductor 3:23.0.1-0ubuntu1~cloud0 all OpenStack Compute - conductor service
rc nova-consoleauth 2:19.2.0-0ubuntu1~cloud0 all OpenStack Compute - Console Authenticator
ii nova-novncproxy 3:23.0.1-0ubuntu1~cloud0 all OpenStack Compute - NoVNC proxy
rc nova-placement-api 2:19.2.0-0ubuntu1~cloud0 all OpenStack Compute - placement API frontend
ii nova-scheduler 3:23.0.1-0ubuntu1~cloud0 all OpenStack Compute - virtual machine scheduler
ii nova-spiceproxy 3:23.0.1-0ubuntu1~cloud0 all OpenStack Compute - spice html5 proxy
rc python-nova 2:18.2.0-0ubuntu2~cloud0 all OpenStack Compute Python 2 libraries
ii python-novaclient 2:13.0.0-0ubuntu1~cloud0 all client library for OpenStack Compute API - Python 2.7
ii python3-nova 3:23.0.1-0ubuntu1~cloud0 all OpenStack Compute Python 3 libraries
ii python3-novaclient 2:17.4.0-0ubuntu1~cloud0 all client library for OpenStack Compute API - 3.x
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1941861/+subscriptions
References