yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1815697] [NEW] [upgrade_levels]compute=auto grinds the API response times when a cell is down

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date: Wed, 13 Feb 2019 00:56:29 -0000
Reply-to: Bug 1815697 <1815697@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

A lot of my notes are in https://review.openstack.org/#/c/591657/ where
I was testing a down cell on a devstack deployment.

To simulate a down cell, I changed the database_connection value for the
cell1 cell to be an invalid IP (192.0.0.1) and then restarted
devstack@n-api.service.

With the default configs in devstack, the service was hanging trying to
respond to a simple GET / request to list versions. It looks like the
problem is because each nova.compute.api.API object that gets created
for each route handler (for each API worker, which in my case is 2)
tries to get the minimum nova-compute service version across all cells:

https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/api.py#L261

https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L373

https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L395

This is a snip of the API log while waiting for the GET / response:

http://paste.openstack.org/show/744983/

As a result I got this unhelpful client side error:

http://paste.openstack.org/show/744984/

I know that's where the failure was because I was also getting this:

Feb 13 00:09:57 downcell devstack@n-api.service[14623]: DEBUG
nova.compute.rpcapi [None req-53ebccae-d210-4b14-af5c-02775f3d36e8 None
None] Not caching compute RPC version_cap, because min service_version
is 0. Please ensure a nova-compute service has been started. Defaulting
to current version. {{(pid=14625) _determine_version_cap
/opt/stack/nova/nova/compute/rpcapi.py:410}}

The minimum nova-compute service version isn't getting cached in nova-
api if running under uwsgi anyway for which I reported bug 1815692.

The way I worked around the issue was by setting
[upgrade_levels]/compute=rocky but that's probably not something we want
to rely on when we can set to 'auto' and have the code calculate it for
us, but it can hang the API workers.

Also note the default database max_attempts and retry_interval are 10
which means for each API object created that hits this, it's going to
take 100 seconds to timeout per route handler per API worker. I count 31
route handlers that create an API object, so that's by default 3100
seconds or about ~52 minutes per worker on startup.

** Affects: nova
     Importance: Medium
         Status: Confirmed


** Tags: api cells performance

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1815697

Title:
  [upgrade_levels]compute=auto grinds the API response times when a cell
  is down

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  A lot of my notes are in https://review.openstack.org/#/c/591657/
  where I was testing a down cell on a devstack deployment.

  To simulate a down cell, I changed the database_connection value for
  the cell1 cell to be an invalid IP (192.0.0.1) and then restarted
  devstack@n-api.service.

  With the default configs in devstack, the service was hanging trying
  to respond to a simple GET / request to list versions. It looks like
  the problem is because each nova.compute.api.API object that gets
  created for each route handler (for each API worker, which in my case
  is 2) tries to get the minimum nova-compute service version across all
  cells:

  https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/api.py#L261

  https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L373

  https://github.com/openstack/nova/blob/0bed18ffbb46c4f2d0ec87e64a39188c165398eb/nova/compute/rpcapi.py#L395

  This is a snip of the API log while waiting for the GET / response:

  http://paste.openstack.org/show/744983/

  As a result I got this unhelpful client side error:

  http://paste.openstack.org/show/744984/

  I know that's where the failure was because I was also getting this:

  Feb 13 00:09:57 downcell devstack@n-api.service[14623]: DEBUG
  nova.compute.rpcapi [None req-53ebccae-d210-4b14-af5c-02775f3d36e8
  None None] Not caching compute RPC version_cap, because min
  service_version is 0. Please ensure a nova-compute service has been
  started. Defaulting to current version. {{(pid=14625)
  _determine_version_cap /opt/stack/nova/nova/compute/rpcapi.py:410}}

  The minimum nova-compute service version isn't getting cached in nova-
  api if running under uwsgi anyway for which I reported bug 1815692.

  The way I worked around the issue was by setting
  [upgrade_levels]/compute=rocky but that's probably not something we
  want to rely on when we can set to 'auto' and have the code calculate
  it for us, but it can hang the API workers.

  Also note the default database max_attempts and retry_interval are 10
  which means for each API object created that hits this, it's going to
  take 100 seconds to timeout per route handler per API worker. I count
  31 route handlers that create an API object, so that's by default 3100
  seconds or about ~52 minutes per worker on startup.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1815697/+subscriptions