yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #80104
[Bug 1844929] [NEW] grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
Public bug reported:
Seen here:
https://zuul.opendev.org/t/openstack/build/d53346210978403f888b85b82b2fe0c7/log/logs/screen-n-sch.txt.gz?severity=3#2368
Sep 22 00:50:54.174385 ubuntu-bionic-ovh-gra1-0011664420 nova-
scheduler[18043]: WARNING nova.context [None req-
1929039e-1517-4326-9700-738d4b570ba6 tempest-
AttachInterfacesUnderV243Test-2009753731 tempest-
AttachInterfacesUnderV243Test-2009753731] Timed out waiting for response
from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90
Looks like something is causing timeouts reaching cell1 during grenade
runs. The only errors I see in the rabbit logs are these for the uwsgi
(API) servers:
=ERROR REPORT==== 22-Sep-2019::00:35:30 ===
closing AMQP connection <0.1511.0> (217.182.141.188:48492 ->
217.182.141.188:5672 - uwsgi:19453:72e08501-61ca-4ade-865e-
f0605979ed7d):
missed heartbeats from client, timeout: 60s
--
It looks like we don't have mysql logs in this grenade run, maybe we
need a fix like this somewhere for grenade:
https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f8489419a2217dc
logstash shows 1101 hits in the last 7 days, since Sept 17 actually:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timed%20out%20waiting%20for%20response%20from%20cell%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=7d
check and gate queues, all failures. It also appears to only show up on
fortnebula and OVH nodes, primarily fortnebula. I wonder if there is a
performing/timing issue if those nodes are slower and we aren't waiting
for something during the grenade upgrade before proceeding.
** Affects: nova
Importance: High
Status: Confirmed
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1844929
Title:
grenade jobs failing due to "Timed out waiting for response from cell"
in scheduler
Status in OpenStack Compute (nova):
Confirmed
Bug description:
Seen here:
https://zuul.opendev.org/t/openstack/build/d53346210978403f888b85b82b2fe0c7/log/logs/screen-n-sch.txt.gz?severity=3#2368
Sep 22 00:50:54.174385 ubuntu-bionic-ovh-gra1-0011664420 nova-
scheduler[18043]: WARNING nova.context [None req-
1929039e-1517-4326-9700-738d4b570ba6 tempest-
AttachInterfacesUnderV243Test-2009753731 tempest-
AttachInterfacesUnderV243Test-2009753731] Timed out waiting for
response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90
Looks like something is causing timeouts reaching cell1 during grenade
runs. The only errors I see in the rabbit logs are these for the uwsgi
(API) servers:
=ERROR REPORT==== 22-Sep-2019::00:35:30 ===
closing AMQP connection <0.1511.0> (217.182.141.188:48492 ->
217.182.141.188:5672 - uwsgi:19453:72e08501-61ca-4ade-865e-
f0605979ed7d):
missed heartbeats from client, timeout: 60s
--
It looks like we don't have mysql logs in this grenade run, maybe we
need a fix like this somewhere for grenade:
https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f8489419a2217dc
logstash shows 1101 hits in the last 7 days, since Sept 17 actually:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timed%20out%20waiting%20for%20response%20from%20cell%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=7d
check and gate queues, all failures. It also appears to only show up
on fortnebula and OVH nodes, primarily fortnebula. I wonder if there
is a performing/timing issue if those nodes are slower and we aren't
waiting for something during the grenade upgrade before proceeding.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1844929/+subscriptions
Follow ups
-
[Bug 1844929] Fix included in openstack/nova rocky-eol
From: OpenStack Infra, 2022-11-11
-
[Bug 1844929] Fix included in openstack/nova queens-eol
From: OpenStack Infra, 2022-11-11
-
[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
From: Elod Illes, 2020-11-24
-
[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
From: Elod Illes, 2020-11-24
-
[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
From: Balazs Gibizer, 2020-04-15
-
[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
From: Balazs Gibizer, 2020-04-15
-
[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
From: melanie witt, 2020-04-10
-
[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
From: melanie witt, 2020-04-10
-
[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
From: OpenStack Infra, 2020-04-09
-
[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
From: Matt Riedemann, 2019-12-20