yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #81418
[Bug 1860436] Re: [ovn] Agent liveness checks are flaky and report false positives
Reviewed: https://review.opendev.org/703612
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=18410097f23a8e3d9cf33393b47d8b1a91020e4a
Submitter: Zuul
Branch: master
commit 18410097f23a8e3d9cf33393b47d8b1a91020e4a
Author: Daniel Alvarez <dalvarez@xxxxxxxxxx>
Date: Tue Jan 21 14:26:22 2020 +0100
[ovn] Agent liveness - allow time to propagate checks
Right now neutron-server bumps the nb_cfg parameter in NB_Global
table which needs to be propagated by northd to SB_Global,
processed by agents, and write it back into SB_Global.
This requires processing by neutron-server but unfortunatelly
the server checks straight away and many times the value read
is behind the expected value.
All this results in frequent false positives showing dead agents
when they are not.
This patch is relaxing the checks by allowing a difference of 1
between the read and expected values.
Change-Id: Id91481b690ad569c5dcfa5bd404f497f591d729d
Closes-Bug: 1860436
Signed-off-by: Daniel Alvarez <dalvarez@xxxxxxxxxx>
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1860436
Title:
[ovn] Agent liveness checks are flaky and report false positives
Status in neutron:
Fix Released
Bug description:
The way that networking-ovn mech driver performs health checks on
agents reports false positives due to race conditions:
1) neutron-server increments the nb_cfg in NB_Global table from X to X+1
2) neutron-server almost immediately checks all the Chassis rows to see if they have written (X+1) . [1]
3) neutron-server process the updates from each agent from X to X+1
*Most* of the times, in step number 2, this condition doesn't hold so
the timestamp is not updated. The result is that after 60 seconds
(agent timeout default value), the agent is shown as dead. Sometimes,
3) happens before 2) so the timestamp gets updated and all is fine but
this is not the normal case:
1) Bump of nb_cfg
2020-01-21 11:35:59.534 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36915
2020-01-21 11:35:59.538 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36916
2) Check of each chassis ext_id against our new bumped nb_cfg:
2020-01-21 11:35:59.539 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.540 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.541 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.542 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.543 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.544 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
2020-01-21 11:35:59.546 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916 chassis nb_cfg = 36915
3) Processing updates [2] in the ChassisEvent (some are even older!)
2020-01-21 11:35:59.546 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36915
2020-01-21 11:35:59.548 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36915
2020-01-21 11:35:59.556 32 INFO networking_ovn.ovsdb.ovsdb_monitor [req-efa34cac-2296-4d30-b153-9630b0309fcd - - - - -] XXX chassis update:
2020-01-21 11:35:59.556 27 INFO networking_ovn.ovsdb.ovsdb_monitor [req-91f7d181-bfa3-4646-9814-bb680d011081 - - - - -] XXX chassis update:
2020-01-21 11:35:59.557 25 INFO networking_ovn.ovsdb.ovsdb_monitor [req-420e5a25-13e4-4da6-8277-8a3a1028c9e9 - - - - -] XXX chassis update:
2020-01-21 11:35:59.756 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36916
2020-01-21 11:35:59.778 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36916
IMO, we need to space the bump of nb_cfg [2] and the check [3] in time
as the NB_Global changes needs to be propagated to the SB, processed
by all agents and then back to neutron-server which needs to process
the JSON stuff and update the internal tables. So even if it's fast,
most of the times it is not fast enough.
Another solution is to allow a difference of '1' to update timestamps.
[0] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1093
[1] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1098
[2] https://github.com/openstack/networking-ovn/blob/bf577e5a999f7db4cb9b790664ad596e1926d9a0/networking_ovn/ml2/mech_driver.py#L988
[3] https://github.com/openstack/networking-ovn/blob/6302298e9c4313f1200c543c89d92629daff9e89/networking_ovn/ovsdb/ovsdb_monitor.py#L74
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1860436/+subscriptions
References