← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler

 

As Melanie correctly stated we need to merge the Train backport of the
bugfix to have the problem disappear from granade jobs as that runs a
train -> master upgrade.

** Changed in: nova
       Status: Confirmed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1844929

Title:
  grenade jobs failing due to "Timed out waiting for response from cell"
  in scheduler

Status in grenade:
  Invalid
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  New
Status in OpenStack Compute (nova) rocky series:
  New
Status in OpenStack Compute (nova) stein series:
  New
Status in OpenStack Compute (nova) train series:
  In Progress

Bug description:
  Seen here:

  https://zuul.opendev.org/t/openstack/build/d53346210978403f888b85b82b2fe0c7/log/logs/screen-n-sch.txt.gz?severity=3#2368

  Sep 22 00:50:54.174385 ubuntu-bionic-ovh-gra1-0011664420 nova-
  scheduler[18043]: WARNING nova.context [None req-
  1929039e-1517-4326-9700-738d4b570ba6 tempest-
  AttachInterfacesUnderV243Test-2009753731 tempest-
  AttachInterfacesUnderV243Test-2009753731] Timed out waiting for
  response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90

  Looks like something is causing timeouts reaching cell1 during grenade
  runs. The only errors I see in the rabbit logs are these for the uwsgi
  (API) servers:

  =ERROR REPORT==== 22-Sep-2019::00:35:30 ===

  closing AMQP connection <0.1511.0> (217.182.141.188:48492 ->
  217.182.141.188:5672 - uwsgi:19453:72e08501-61ca-4ade-865e-
  f0605979ed7d):

  missed heartbeats from client, timeout: 60s

  --

  It looks like we don't have mysql logs in this grenade run, maybe we
  need a fix like this somewhere for grenade:

  https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f8489419a2217dc

  logstash shows 1101 hits in the last 7 days, since Sept 17 actually:

  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timed%20out%20waiting%20for%20response%20from%20cell%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=7d

  check and gate queues, all failures. It also appears to only show up
  on fortnebula and OVH nodes, primarily fortnebula. I wonder if there
  is a performing/timing issue if those nodes are slower and we aren't
  waiting for something during the grenade upgrade before proceeding.

To manage notifications about this bug go to:
https://bugs.launchpad.net/grenade/+bug/1844929/+subscriptions


References