← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler

 

It seems the problem is still happening even after the fix merged to
master.

Here is a recent appearance
https://7ad29d1b700c1da60ae0-1bae5319fe4594ade335a46ad1c3bcc9.ssl.cf2.rackcdn.com/717083/5/check
/neutron-grenade-multinode/2be9497/logs/screen-n-sch.txt

** Changed in: nova
       Status: Fix Released => Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1844929

Title:
  grenade jobs failing due to "Timed out waiting for response from cell"
  in scheduler

Status in grenade:
  Invalid
Status in OpenStack Compute (nova):
  Confirmed
Status in OpenStack Compute (nova) queens series:
  New
Status in OpenStack Compute (nova) rocky series:
  New
Status in OpenStack Compute (nova) stein series:
  New
Status in OpenStack Compute (nova) train series:
  In Progress

Bug description:
  Seen here:

  https://zuul.opendev.org/t/openstack/build/d53346210978403f888b85b82b2fe0c7/log/logs/screen-n-sch.txt.gz?severity=3#2368

  Sep 22 00:50:54.174385 ubuntu-bionic-ovh-gra1-0011664420 nova-
  scheduler[18043]: WARNING nova.context [None req-
  1929039e-1517-4326-9700-738d4b570ba6 tempest-
  AttachInterfacesUnderV243Test-2009753731 tempest-
  AttachInterfacesUnderV243Test-2009753731] Timed out waiting for
  response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90

  Looks like something is causing timeouts reaching cell1 during grenade
  runs. The only errors I see in the rabbit logs are these for the uwsgi
  (API) servers:

  =ERROR REPORT==== 22-Sep-2019::00:35:30 ===

  closing AMQP connection <0.1511.0> (217.182.141.188:48492 ->
  217.182.141.188:5672 - uwsgi:19453:72e08501-61ca-4ade-865e-
  f0605979ed7d):

  missed heartbeats from client, timeout: 60s

  --

  It looks like we don't have mysql logs in this grenade run, maybe we
  need a fix like this somewhere for grenade:

  https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f8489419a2217dc

  logstash shows 1101 hits in the last 7 days, since Sept 17 actually:

  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timed%20out%20waiting%20for%20response%20from%20cell%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=7d

  check and gate queues, all failures. It also appears to only show up
  on fortnebula and OVH nodes, primarily fortnebula. I wonder if there
  is a performing/timing issue if those nodes are slower and we aren't
  waiting for something during the grenade upgrade before proceeding.

To manage notifications about this bug go to:
https://bugs.launchpad.net/grenade/+bug/1844929/+subscriptions


References