← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1912379] [NEW] Neutron causes systemd to hang on Linux guests with SELinux disabled

 

Public bug reported:

We have observed an issue that exists at least in the Ussuri release
where a linux guest VM will have its systemd process end up hung and
utilize 100% CPU if SELinux is disabled or set to permissive.

As of now we have only only verified this issue with CentOS 7 and 8
guests as that is the only Linux OS we use.

We believe we have tracked the issue to something in Neutron and
possibly more specifically to the way remote security group rules are
processed and/or to an issue with inter-server communication when
SELinux is disabled in the guest.

We have observed the same behavior on multiple deployments whether it is
an all-in-one deployment with LVM backed cinder volumes or a multinode
deployment with a ceph backend.  What we have learned / observed so far
is the following:

If SELinux is disabled/permissive in the guest VM AND the "Default"
security group settings exist that have rule, "openstack security group
rule create --remote-group default default", the systemd process will
spike to 100% CPU usage (all cores) in the guest VM and requiring
rebooting to clear the issue.  The issue recurs several hours after
reboot.  The problem exists whether or not other VMs exist on the
network and/or traffic is being passed.  Simply having the ability to
pass traffic causes the issue.  Further, in the one test scenario we had
and this issue was discovered that led us digging further into the
cause, network performance between VMs with this configuration was poor
and there would be / was high latency between the VMs on the network.
It was a web server / mySql server and queries would have 10 second
runtimes when being called from the web server but execute in
milliseconds when run directly on the mySql server.

If the rule for "openstack security group rule create --remote-group
default default" is removed from the server, the problem does not recur
after a reboot.

Likewise, if SELinux is enabled in the guest, everything works fine.

We also ran strace on the systemd process in the guest VMs while the
CPUs were pegged and all VMs exhibiting this behavior appeared to be
stuck in a perpetual "wait" state.  strace output from seemingly hung
systemd process:

epoll_pwait(4, [], 1024, 196, NULL, 8)  = 0	
epoll_pwait(4, [], 1024, 443, NULL, 8)  = 0	
epoll_pwait(4, [], 1024, 49, NULL, 8)   = 0	
epoll_pwait(4, [], 1024, 500, NULL, 8)  = 0	
epoll_pwait(4, [], 1024, 447, NULL, 8)  = 0	
epoll_pwait(4, [], 1024, 52, NULL, 8)   = 0

This goes on over and over and every once in a while in the middle of
that we receive what appears to be a json request:

read(9, "\1\0\0\0\0\0\0\0", 1024)       = 8	
write(13, "{\"id\":92,\"jsonrpc\":\"2.0\",\"method"..., 248) = 248	
epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN, {u32=13, u64=13}}) = 0	
epoll_pwait(4, [{EPOLLIN, {u32=13, u64=13}}], 1024, 335, NULL, 8) = 1	
read(13, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 2048) = 376	
futex(0xa153e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xa153e0, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1	14:27
epoll_pwait(4, [{EPOLLIN, {u32=9, u64=9}}], 1024, 135, NULL, 8) = 1

All setups that we are observing this behavior on were deployed with
kolla-ansible and are running the latest Ussuri release.  KVM for
hypervisor on all.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1912379

Title:
  Neutron causes systemd to hang on Linux guests with SELinux disabled

Status in neutron:
  New

Bug description:
  We have observed an issue that exists at least in the Ussuri release
  where a linux guest VM will have its systemd process end up hung and
  utilize 100% CPU if SELinux is disabled or set to permissive.

  As of now we have only only verified this issue with CentOS 7 and 8
  guests as that is the only Linux OS we use.

  We believe we have tracked the issue to something in Neutron and
  possibly more specifically to the way remote security group rules are
  processed and/or to an issue with inter-server communication when
  SELinux is disabled in the guest.

  We have observed the same behavior on multiple deployments whether it
  is an all-in-one deployment with LVM backed cinder volumes or a
  multinode deployment with a ceph backend.  What we have learned /
  observed so far is the following:

  If SELinux is disabled/permissive in the guest VM AND the "Default"
  security group settings exist that have rule, "openstack security
  group rule create --remote-group default default", the systemd process
  will spike to 100% CPU usage (all cores) in the guest VM and requiring
  rebooting to clear the issue.  The issue recurs several hours after
  reboot.  The problem exists whether or not other VMs exist on the
  network and/or traffic is being passed.  Simply having the ability to
  pass traffic causes the issue.  Further, in the one test scenario we
  had and this issue was discovered that led us digging further into the
  cause, network performance between VMs with this configuration was
  poor and there would be / was high latency between the VMs on the
  network.  It was a web server / mySql server and queries would have 10
  second runtimes when being called from the web server but execute in
  milliseconds when run directly on the mySql server.

  If the rule for "openstack security group rule create --remote-group
  default default" is removed from the server, the problem does not
  recur after a reboot.

  Likewise, if SELinux is enabled in the guest, everything works fine.

  We also ran strace on the systemd process in the guest VMs while the
  CPUs were pegged and all VMs exhibiting this behavior appeared to be
  stuck in a perpetual "wait" state.  strace output from seemingly hung
  systemd process:

  epoll_pwait(4, [], 1024, 196, NULL, 8)  = 0	
  epoll_pwait(4, [], 1024, 443, NULL, 8)  = 0	
  epoll_pwait(4, [], 1024, 49, NULL, 8)   = 0	
  epoll_pwait(4, [], 1024, 500, NULL, 8)  = 0	
  epoll_pwait(4, [], 1024, 447, NULL, 8)  = 0	
  epoll_pwait(4, [], 1024, 52, NULL, 8)   = 0

  This goes on over and over and every once in a while in the middle of
  that we receive what appears to be a json request:

  read(9, "\1\0\0\0\0\0\0\0", 1024)       = 8	
  write(13, "{\"id\":92,\"jsonrpc\":\"2.0\",\"method"..., 248) = 248	
  epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN, {u32=13, u64=13}}) = 0	
  epoll_pwait(4, [{EPOLLIN, {u32=13, u64=13}}], 1024, 335, NULL, 8) = 1	
  read(13, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 2048) = 376	
  futex(0xa153e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xa153e0, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1	14:27
  epoll_pwait(4, [{EPOLLIN, {u32=9, u64=9}}], 1024, 135, NULL, 8) = 1

  All setups that we are observing this behavior on were deployed with
  kolla-ansible and are running the latest Ussuri release.  KVM for
  hypervisor on all.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1912379/+subscriptions


Follow ups