← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1912379] Re: Neutron causes systemd to hang on Linux guests with SELinux disabled

 

[Expired for neutron because there has been no activity for 60 days.]

** Changed in: neutron
       Status: Incomplete => Expired

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1912379

Title:
  Neutron causes systemd to hang on Linux guests with SELinux disabled

Status in neutron:
  Expired

Bug description:
  We have observed an issue that exists at least in the Ussuri release
  where a linux guest VM will have its systemd process end up hung and
  utilize 100% CPU if SELinux is disabled or set to permissive.

  As of now we have only only verified this issue with CentOS 7 and 8
  guests as that is the only Linux OS we use.

  We believe we have tracked the issue to something in Neutron and
  possibly more specifically to the way remote security group rules are
  processed and/or to an issue with inter-server communication when
  SELinux is disabled in the guest.

  We have observed the same behavior on multiple deployments whether it
  is an all-in-one deployment with LVM backed cinder volumes or a
  multinode deployment with a ceph backend.  What we have learned /
  observed so far is the following:

  If SELinux is disabled/permissive in the guest VM AND the "Default"
  security group settings exist that have rule, "openstack security
  group rule create --remote-group default default", the systemd process
  will spike to 100% CPU usage (all cores) in the guest VM and requiring
  rebooting to clear the issue.  The issue recurs several hours after
  reboot.  The problem exists whether or not other VMs exist on the
  network and/or traffic is being passed.  Simply having the ability to
  pass traffic causes the issue.  Further, in the one test scenario we
  had and this issue was discovered that led us digging further into the
  cause, network performance between VMs with this configuration was
  poor and there would be / was high latency between the VMs on the
  network.  It was a web server / mySql server and queries would have 10
  second runtimes when being called from the web server but execute in
  milliseconds when run directly on the mySql server.

  If the rule for "openstack security group rule create --remote-group
  default default" is removed from the server, the problem does not
  recur after a reboot.

  Likewise, if SELinux is enabled in the guest, everything works fine.

  We also ran strace on the systemd process in the guest VMs while the
  CPUs were pegged and all VMs exhibiting this behavior appeared to be
  stuck in a perpetual "wait" state.  strace output from seemingly hung
  systemd process:

  epoll_pwait(4, [], 1024, 196, NULL, 8)  = 0	
  epoll_pwait(4, [], 1024, 443, NULL, 8)  = 0	
  epoll_pwait(4, [], 1024, 49, NULL, 8)   = 0	
  epoll_pwait(4, [], 1024, 500, NULL, 8)  = 0	
  epoll_pwait(4, [], 1024, 447, NULL, 8)  = 0	
  epoll_pwait(4, [], 1024, 52, NULL, 8)   = 0

  This goes on over and over and every once in a while in the middle of
  that we receive what appears to be a json request:

  read(9, "\1\0\0\0\0\0\0\0", 1024)       = 8	
  write(13, "{\"id\":92,\"jsonrpc\":\"2.0\",\"method"..., 248) = 248	
  epoll_ctl(4, EPOLL_CTL_MOD, 13, {EPOLLIN, {u32=13, u64=13}}) = 0	
  epoll_pwait(4, [{EPOLLIN, {u32=13, u64=13}}], 1024, 335, NULL, 8) = 1	
  read(13, "{\"jsonrpc\":\"2.0\",\"method\":\"job\","..., 2048) = 376	
  futex(0xa153e4, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0xa153e0, FUTEX_OP_SET<<28|0<<12|FUTEX_OP_CMP_GT<<24|0x1) = 1	14:27
  epoll_pwait(4, [{EPOLLIN, {u32=9, u64=9}}], 1024, 135, NULL, 8) = 1

  All setups that we are observing this behavior on were deployed with
  kolla-ansible and are running the latest Ussuri release.  KVM for
  hypervisor on all.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1912379/+subscriptions


References