yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #85599
[Bug 1912379] Re: Neutron causes systemd to hang on Linux guests with SELinux disabled
[Expired for neutron because there has been no activity for 60 days.]
** Changed in: neutron
Status: Incomplete => Expired
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1912379
Title:
Neutron causes systemd to hang on Linux guests with SELinux disabled
Status in neutron:
Expired
Bug description:
We have observed an issue that exists at least in the Ussuri release
where a linux guest VM will have its systemd process end up hung and
utilize 100% CPU if SELinux is disabled or set to permissive.
As of now we have only only verified this issue with CentOS 7 and 8
guests as that is the only Linux OS we use.
We believe we have tracked the issue to something in Neutron and
possibly more specifically to the way remote security group rules are
processed and/or to an issue with inter-server communication when
SELinux is disabled in the guest.
We have observed the same behavior on multiple deployments whether it
is an all-in-one deployment with LVM backed cinder volumes or a
multinode deployment with a ceph backend. What we have learned /
observed so far is the following:
If SELinux is disabled/permissive in the guest VM AND the "Default"
security group settings exist that have rule, "openstack security
group rule create --remote-group default default", the systemd process
will spike to 100% CPU usage (all cores) in the guest VM and requiring
rebooting to clear the issue. The issue recurs several hours after
reboot. The problem exists whether or not other VMs exist on the
network and/or traffic is being passed. Simply having the ability to
pass traffic causes the issue. Further, in the one test scenario we
had and this issue was discovered that led us digging further into the
cause, network performance between VMs with this configuration was
poor and there would be / was high latency between the VMs on the
network. It was a web server / mySql server and queries would have 10
second runtimes when being called from the web server but execute in
milliseconds when run directly on the mySql server.
If the rule for "openstack security group rule create --remote-group
default default" is removed from the server, the problem does not
recur after a reboot.
Likewise, if SELinux is enabled in the guest, everything works fine.
We also ran strace on the systemd process in the guest VMs while the
CPUs were pegged and all VMs exhibiting this behavior appeared to be
stuck in a perpetual "wait" state. strace output from seemingly hung
systemd process:
epoll_pwait(4, [], 1024, 196, NULL, 8) = 0
epoll_pwait(4, [], 1024, 443, NULL, 8) = 0
epoll_pwait(4, [], 1024, 49, NULL, 8) = 0
epoll_pwait(4, [], 1024, 500, NULL, 8) = 0
epoll_pwait(4, [], 1024, 447, NULL, 8) = 0
epoll_pwait(4, [], 1024, 52, NULL, 8) = 0
This goes on over and over and every once in a while in the middle of
that we receive what appears to be a json request:
read(9, "\1\0\0\0\0\0\0\0", 1024) = 8
write(13, "{\"id\":92,\"jsonrpc\":\"2.0\",\"method"..., 248) = 248
epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN, {u32=13, u64=13}}) = 0
epoll_pwait(4, [{EPOLLIN, {u32=13, u64=13}}], 1024, 335, NULL, 8) = 1
read(13, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 2048) = 376
futex(0xa153e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xa153e0, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1 14:27
epoll_pwait(4, [{EPOLLIN, {u32=9, u64=9}}], 1024, 135, NULL, 8) = 1
All setups that we are observing this behavior on were deployed with
kolla-ansible and are running the latest Ussuri release. KVM for
hypervisor on all.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1912379/+subscriptions
References