yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #82302
[Bug 1844929] Re: grenade jobs failing due to "Timed out waiting for response from cell" in scheduler
It seems the problem is still happening even after the fix merged to
master.
Here is a recent appearance
https://7ad29d1b700c1da60ae0-1bae5319fe4594ade335a46ad1c3bcc9.ssl.cf2.rackcdn.com/717083/5/check
/neutron-grenade-multinode/2be9497/logs/screen-n-sch.txt
** Changed in: nova
Status: Fix Released => Confirmed
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1844929
Title:
grenade jobs failing due to "Timed out waiting for response from cell"
in scheduler
Status in grenade:
Invalid
Status in OpenStack Compute (nova):
Confirmed
Status in OpenStack Compute (nova) queens series:
New
Status in OpenStack Compute (nova) rocky series:
New
Status in OpenStack Compute (nova) stein series:
New
Status in OpenStack Compute (nova) train series:
In Progress
Bug description:
Seen here:
https://zuul.opendev.org/t/openstack/build/d53346210978403f888b85b82b2fe0c7/log/logs/screen-n-sch.txt.gz?severity=3#2368
Sep 22 00:50:54.174385 ubuntu-bionic-ovh-gra1-0011664420 nova-
scheduler[18043]: WARNING nova.context [None req-
1929039e-1517-4326-9700-738d4b570ba6 tempest-
AttachInterfacesUnderV243Test-2009753731 tempest-
AttachInterfacesUnderV243Test-2009753731] Timed out waiting for
response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90
Looks like something is causing timeouts reaching cell1 during grenade
runs. The only errors I see in the rabbit logs are these for the uwsgi
(API) servers:
=ERROR REPORT==== 22-Sep-2019::00:35:30 ===
closing AMQP connection <0.1511.0> (217.182.141.188:48492 ->
217.182.141.188:5672 - uwsgi:19453:72e08501-61ca-4ade-865e-
f0605979ed7d):
missed heartbeats from client, timeout: 60s
--
It looks like we don't have mysql logs in this grenade run, maybe we
need a fix like this somewhere for grenade:
https://github.com/openstack/devstack/commit/f92c346131db2c89b930b1a23f8489419a2217dc
logstash shows 1101 hits in the last 7 days, since Sept 17 actually:
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Timed%20out%20waiting%20for%20response%20from%20cell%5C%22%20AND%20tags%3A%5C%22screen-n-sch.txt%5C%22&from=7d
check and gate queues, all failures. It also appears to only show up
on fortnebula and OVH nodes, primarily fortnebula. I wonder if there
is a performing/timing issue if those nodes are slower and we aren't
waiting for something during the grenade upgrade before proceeding.
To manage notifications about this bug go to:
https://bugs.launchpad.net/grenade/+bug/1844929/+subscriptions
References