yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #76949
[Bug 1815697] [NEW] [upgrade_levels]compute=auto grinds the API response times when a cell is down
Public bug reported:
A lot of my notes are in https://review.openstack.org/#/c/591657/ where
I was testing a down cell on a devstack deployment.
To simulate a down cell, I changed the database_connection value for the
cell1 cell to be an invalid IP (192.0.0.1) and then restarted
devstack@n-api.service.
With the default configs in devstack, the service was hanging trying to
respond to a simple GET / request to list versions. It looks like the
problem is because each nova.compute.api.API object that gets created
for each route handler (for each API worker, which in my case is 2)
tries to get the minimum nova-compute service version across all cells:
https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/api.py#L261
https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L373
https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L395
This is a snip of the API log while waiting for the GET / response:
http://paste.openstack.org/show/744983/
As a result I got this unhelpful client side error:
http://paste.openstack.org/show/744984/
I know that's where the failure was because I was also getting this:
Feb 13 00:09:57 downcell devstack@n-api.service[14623]: DEBUG
nova.compute.rpcapi [None req-53ebccae-d210-4b14-af5c-02775f3d36e8 None
None] Not caching compute RPC version_cap, because min service_version
is 0. Please ensure a nova-compute service has been started. Defaulting
to current version. {{(pid=14625) _determine_version_cap
/opt/stack/nova/nova/compute/rpcapi.py:410}}
The minimum nova-compute service version isn't getting cached in nova-
api if running under uwsgi anyway for which I reported bug 1815692.
The way I worked around the issue was by setting
[upgrade_levels]/compute=rocky but that's probably not something we want
to rely on when we can set to 'auto' and have the code calculate it for
us, but it can hang the API workers.
Also note the default database max_attempts and retry_interval are 10
which means for each API object created that hits this, it's going to
take 100 seconds to timeout per route handler per API worker. I count 31
route handlers that create an API object, so that's by default 3100
seconds or about ~52 minutes per worker on startup.
** Affects: nova
Importance: Medium
Status: Confirmed
** Tags: api cells performance
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1815697
Title:
[upgrade_levels]compute=auto grinds the API response times when a cell
is down
Status in OpenStack Compute (nova):
Confirmed
Bug description:
A lot of my notes are in https://review.openstack.org/#/c/591657/
where I was testing a down cell on a devstack deployment.
To simulate a down cell, I changed the database_connection value for
the cell1 cell to be an invalid IP (192.0.0.1) and then restarted
devstack@n-api.service.
With the default configs in devstack, the service was hanging trying
to respond to a simple GET / request to list versions. It looks like
the problem is because each nova.compute.api.API object that gets
created for each route handler (for each API worker, which in my case
is 2) tries to get the minimum nova-compute service version across all
cells:
https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/api.py#L261
https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L373
https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L395
This is a snip of the API log while waiting for the GET / response:
http://paste.openstack.org/show/744983/
As a result I got this unhelpful client side error:
http://paste.openstack.org/show/744984/
I know that's where the failure was because I was also getting this:
Feb 13 00:09:57 downcell devstack@n-api.service[14623]: DEBUG
nova.compute.rpcapi [None req-53ebccae-d210-4b14-af5c-02775f3d36e8
None None] Not caching compute RPC version_cap, because min
service_version is 0. Please ensure a nova-compute service has been
started. Defaulting to current version. {{(pid=14625)
_determine_version_cap /opt/stack/nova/nova/compute/rpcapi.py:410}}
The minimum nova-compute service version isn't getting cached in nova-
api if running under uwsgi anyway for which I reported bug 1815692.
The way I worked around the issue was by setting
[upgrade_levels]/compute=rocky but that's probably not something we
want to rely on when we can set to 'auto' and have the code calculate
it for us, but it can hang the API workers.
Also note the default database max_attempts and retry_interval are 10
which means for each API object created that hits this, it's going to
take 100 seconds to timeout per route handler per API worker. I count
31 route handlers that create an API object, so that's by default 3100
seconds or about ~52 minutes per worker on startup.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1815697/+subscriptions