← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2024205] [NEW] [OVN] Hash Ring nodes removed when "periodic worker" is killed

 

Public bug reported:

Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2213910

In the ML2/OVN driver we set a signal handler for SIGTERM to remove the
hash ring nodes upon the service exit [0] but, during the investigation
of one bug with a customer we identified that an unrelated Neutron
worker is killed (such as the "periodic worker" in this case) this could
lead to that process removing the entries from the ovn_hash_ring table
for that hostname.

If this happens on all controllers, the ovn_hash_ring table is rendered
empty and OVSDB events are no longer processed by ML2/OVN.

Proposed solution:

This LP proposes to make this more reliable by instead of removing the
nodes from the ovn_hash_ring table at exiting, we would mark them as
offline instead. That way, if a worker dies the nodes will remain
registered in the table and the heartbeat thread will set them as online
again on the next beat. If the service is properly stopped the heartbeat
won't be running and the nodes will be seeing as offline to the Hash
Ring manager.

As a note, upon the next startup of the service the nodes matching the
server hostname will be removed from the ovn_hash_ring table and added
again accordingly as Neutron worker are spawned [1].

[0] https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L295-L296
[1] https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L316

** Affects: neutron
     Importance: High
     Assignee: Lucas Alvares Gomes (lucasagomes)
         Status: Confirmed


** Tags: ovn

** Changed in: neutron
       Status: Fix Committed => Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2024205

Title:
  [OVN] Hash Ring nodes removed when "periodic worker" is killed

Status in neutron:
  Confirmed

Bug description:
  Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2213910

  In the ML2/OVN driver we set a signal handler for SIGTERM to remove
  the hash ring nodes upon the service exit [0] but, during the
  investigation of one bug with a customer we identified that an
  unrelated Neutron worker is killed (such as the "periodic worker" in
  this case) this could lead to that process removing the entries from
  the ovn_hash_ring table for that hostname.

  If this happens on all controllers, the ovn_hash_ring table is
  rendered empty and OVSDB events are no longer processed by ML2/OVN.

  Proposed solution:

  This LP proposes to make this more reliable by instead of removing the
  nodes from the ovn_hash_ring table at exiting, we would mark them as
  offline instead. That way, if a worker dies the nodes will remain
  registered in the table and the heartbeat thread will set them as
  online again on the next beat. If the service is properly stopped the
  heartbeat won't be running and the nodes will be seeing as offline to
  the Hash Ring manager.

  As a note, upon the next startup of the service the nodes matching the
  server hostname will be removed from the ovn_hash_ring table and added
  again accordingly as Neutron worker are spawned [1].

  [0] https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L295-L296
  [1] https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L316

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2024205/+subscriptions



Follow ups