yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #92505
[Bug 2024205] [NEW] [OVN] Hash Ring nodes removed when "periodic worker" is killed
Public bug reported:
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2213910
In the ML2/OVN driver we set a signal handler for SIGTERM to remove the
hash ring nodes upon the service exit [0] but, during the investigation
of one bug with a customer we identified that an unrelated Neutron
worker is killed (such as the "periodic worker" in this case) this could
lead to that process removing the entries from the ovn_hash_ring table
for that hostname.
If this happens on all controllers, the ovn_hash_ring table is rendered
empty and OVSDB events are no longer processed by ML2/OVN.
Proposed solution:
This LP proposes to make this more reliable by instead of removing the
nodes from the ovn_hash_ring table at exiting, we would mark them as
offline instead. That way, if a worker dies the nodes will remain
registered in the table and the heartbeat thread will set them as online
again on the next beat. If the service is properly stopped the heartbeat
won't be running and the nodes will be seeing as offline to the Hash
Ring manager.
As a note, upon the next startup of the service the nodes matching the
server hostname will be removed from the ovn_hash_ring table and added
again accordingly as Neutron worker are spawned [1].
[0] https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L295-L296
[1] https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L316
** Affects: neutron
Importance: High
Assignee: Lucas Alvares Gomes (lucasagomes)
Status: Confirmed
** Tags: ovn
** Changed in: neutron
Status: Fix Committed => Confirmed
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2024205
Title:
[OVN] Hash Ring nodes removed when "periodic worker" is killed
Status in neutron:
Confirmed
Bug description:
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2213910
In the ML2/OVN driver we set a signal handler for SIGTERM to remove
the hash ring nodes upon the service exit [0] but, during the
investigation of one bug with a customer we identified that an
unrelated Neutron worker is killed (such as the "periodic worker" in
this case) this could lead to that process removing the entries from
the ovn_hash_ring table for that hostname.
If this happens on all controllers, the ovn_hash_ring table is
rendered empty and OVSDB events are no longer processed by ML2/OVN.
Proposed solution:
This LP proposes to make this more reliable by instead of removing the
nodes from the ovn_hash_ring table at exiting, we would mark them as
offline instead. That way, if a worker dies the nodes will remain
registered in the table and the heartbeat thread will set them as
online again on the next beat. If the service is properly stopped the
heartbeat won't be running and the nodes will be seeing as offline to
the Hash Ring manager.
As a note, upon the next startup of the service the nodes matching the
server hostname will be removed from the ovn_hash_ring table and added
again accordingly as Neutron worker are spawned [1].
[0] https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L295-L296
[1] https://github.com/openstack/neutron/blob/cbb89fdb1414a1b3a8e8b3a9a4154ef627bb9d1a/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py#L316
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2024205/+subscriptions
Follow ups