yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #95084
[Bug 2092297] [NEW] Nova-compute service state flapping down/up
Public bug reported:
We're running Antelope (2023.1) on two environments test and production.
Issue manifests on both.
Nova-compute service state started to fail - or actually flap down/up in
seemingly random (but often and shortening) intervals complaining about
Rabbit connectivity. No other services have issue with Rabbit. Once
nova-compute container is restarted on specific compute host the issue
seem to be solved, however after few days it starts reoccurring more
often and often, gradually shortening the interval.
When the service get's down we observed following log sequence -
communication between nova and rabbit, filtered for specific id to trace
single thread:
Initial connection to rabbit:
Nov 29, 2024 @ 11:12:18.471 info controller-1 rabbit <0.17767.1487> connection <0.17767.1487> (172.16.4.52:33664 -> 172.16.4.22:5672) has a client-provided name: nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63
Nov 29, 2024 @ 11:12:18.471 info controller-1 rabbit <0.17767.1487> connection <0.17767.1487> (172.16.4.52:33664 -> 172.16.4.22:5672 - nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63): user 'openstack' authenticated and granted access to vhost '/'
11 days of silence.
First occurence:
Dec 10, 2024 @ 13:16:42.395 info controller-1 rabbit <0.16505.2454> connection <0.16505.2454> (172.16.4.52:34228 -> 172.16.4.22:5672 - nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63): user 'openstack' authenticated and granted access to vhost '/'
Dec 10, 2024 @ 13:16:40.775 info controller-1 rabbit <0.16505.2454> connection <0.16505.2454> (172.16.4.52:34228 -> 172.16.4.22:5672) has a client-provided name: nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63
Dec 10, 2024 @ 13:16:40.681 INFO compute-36 nova-compute [671a3304-8303-4e01-a1ab-3990e1869a63] Reconnected to AMQP server on 172.16.4.22:5672 via [amqp] client with port 34228.
Dec 10, 2024 @ 13:16:39.669 ERROR compute-36 nova-compute [671a3304-8303-4e01-a1ab-3990e1869a63] AMQP server on 172.16.4.22:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
Dec 10, 2024 @ 13:16:34.610 error controller-1 rabbit <0.17767.1487> closing AMQP connection <0.17767.1487> (172.16.4.52:33664 -> 172.16.4.22:5672 - nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63):
Second occurence 5 hours later, after that interval shortens.
Dec 10, 2024 @ 18:03:43.279 info controller-1 rabbit <0.31431.2484> connection <0.31431.2484> (172.16.4.52:42630 -> 172.16.4.22:5672 - nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63): user 'openstack' authenticated and granted access to vhost '/'
Dec 10, 2024 @ 18:03:42.639 info controller-1 rabbit <0.31431.2484> connection <0.31431.2484> (172.16.4.52:42630 -> 172.16.4.22:5672) has a client-provided name: nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63
Dec 10, 2024 @ 18:03:42.545 INFO compute-36 nova-compute [671a3304-8303-4e01-a1ab-3990e1869a63] Reconnected to AMQP server on 172.16.4.22:5672 via [amqp] client with port 42630.
Dec 10, 2024 @ 18:03:41.534 ERROR compute-36 nova-compute [671a3304-8303-4e01-a1ab-3990e1869a63] AMQP server on 172.16.4.22:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
Dec 10, 2024 @ 18:03:41.063 error controller-1 rabbit <0.16505.2454> closing AMQP connection <0.16505.2454> (172.16.4.52:34228 -> 172.16.4.22:5672 - nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63):
This id seem to error till today. 700+ log entries today, each time
connection is closed, nova complains about rabbit being unreachable and
after few attempts reconnects.
Nova container build: 2023.1 commit 47428f6caf503b94583dac614b59971f60a0ba9c
Rabbit version: 3.11.28 on Erlang 25.3.2.12
Hypervisor: libvirt + kvm
Storage: ceph
Network: OVN
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2092297
Title:
Nova-compute service state flapping down/up
Status in OpenStack Compute (nova):
New
Bug description:
We're running Antelope (2023.1) on two environments test and production.
Issue manifests on both.
Nova-compute service state started to fail - or actually flap down/up
in seemingly random (but often and shortening) intervals complaining
about Rabbit connectivity. No other services have issue with Rabbit.
Once nova-compute container is restarted on specific compute host the
issue seem to be solved, however after few days it starts reoccurring
more often and often, gradually shortening the interval.
When the service get's down we observed following log sequence -
communication between nova and rabbit, filtered for specific id to
trace single thread:
Initial connection to rabbit:
Nov 29, 2024 @ 11:12:18.471 info controller-1 rabbit <0.17767.1487> connection <0.17767.1487> (172.16.4.52:33664 -> 172.16.4.22:5672) has a client-provided name: nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63
Nov 29, 2024 @ 11:12:18.471 info controller-1 rabbit <0.17767.1487> connection <0.17767.1487> (172.16.4.52:33664 -> 172.16.4.22:5672 - nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63): user 'openstack' authenticated and granted access to vhost '/'
11 days of silence.
First occurence:
Dec 10, 2024 @ 13:16:42.395 info controller-1 rabbit <0.16505.2454> connection <0.16505.2454> (172.16.4.52:34228 -> 172.16.4.22:5672 - nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63): user 'openstack' authenticated and granted access to vhost '/'
Dec 10, 2024 @ 13:16:40.775 info controller-1 rabbit <0.16505.2454> connection <0.16505.2454> (172.16.4.52:34228 -> 172.16.4.22:5672) has a client-provided name: nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63
Dec 10, 2024 @ 13:16:40.681 INFO compute-36 nova-compute [671a3304-8303-4e01-a1ab-3990e1869a63] Reconnected to AMQP server on 172.16.4.22:5672 via [amqp] client with port 34228.
Dec 10, 2024 @ 13:16:39.669 ERROR compute-36 nova-compute [671a3304-8303-4e01-a1ab-3990e1869a63] AMQP server on 172.16.4.22:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
Dec 10, 2024 @ 13:16:34.610 error controller-1 rabbit <0.17767.1487> closing AMQP connection <0.17767.1487> (172.16.4.52:33664 -> 172.16.4.22:5672 - nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63):
Second occurence 5 hours later, after that interval shortens.
Dec 10, 2024 @ 18:03:43.279 info controller-1 rabbit <0.31431.2484> connection <0.31431.2484> (172.16.4.52:42630 -> 172.16.4.22:5672 - nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63): user 'openstack' authenticated and granted access to vhost '/'
Dec 10, 2024 @ 18:03:42.639 info controller-1 rabbit <0.31431.2484> connection <0.31431.2484> (172.16.4.52:42630 -> 172.16.4.22:5672) has a client-provided name: nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63
Dec 10, 2024 @ 18:03:42.545 INFO compute-36 nova-compute [671a3304-8303-4e01-a1ab-3990e1869a63] Reconnected to AMQP server on 172.16.4.22:5672 via [amqp] client with port 42630.
Dec 10, 2024 @ 18:03:41.534 ERROR compute-36 nova-compute [671a3304-8303-4e01-a1ab-3990e1869a63] AMQP server on 172.16.4.22:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: amqp.exceptions.RecoverableConnectionError: <RecoverableConnectionError: unknown error>
Dec 10, 2024 @ 18:03:41.063 error controller-1 rabbit <0.16505.2454> closing AMQP connection <0.16505.2454> (172.16.4.52:34228 -> 172.16.4.22:5672 - nova-compute:7:671a3304-8303-4e01-a1ab-3990e1869a63):
This id seem to error till today. 700+ log entries today, each time
connection is closed, nova complains about rabbit being unreachable
and after few attempts reconnects.
Nova container build: 2023.1 commit 47428f6caf503b94583dac614b59971f60a0ba9c
Rabbit version: 3.11.28 on Erlang 25.3.2.12
Hypervisor: libvirt + kvm
Storage: ceph
Network: OVN
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2092297/+subscriptions