← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1860436] Re: [ovn] Agent liveness checks are flaky and report false positives

 

Reviewed:  https://review.opendev.org/703612
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=18410097f23a8e3d9cf33393b47d8b1a91020e4a
Submitter: Zuul
Branch:    master

commit 18410097f23a8e3d9cf33393b47d8b1a91020e4a
Author: Daniel Alvarez <dalvarez@xxxxxxxxxx>
Date:   Tue Jan 21 14:26:22 2020 +0100

    [ovn] Agent liveness - allow time to propagate checks
    
    Right now neutron-server bumps the nb_cfg parameter in NB_Global
    table which needs to be propagated by northd to SB_Global,
    processed by agents, and write it back into SB_Global.
    This requires processing by neutron-server but unfortunatelly
    the server checks straight away and many times the value read
    is behind the expected value.
    
    All this results in frequent false positives showing dead agents
    when they are not.
    
    This patch is relaxing the checks by allowing a difference of 1
    between the read and expected values.
    
    Change-Id: Id91481b690ad569c5dcfa5bd404f497f591d729d
    Closes-Bug: 1860436
    Signed-off-by: Daniel Alvarez <dalvarez@xxxxxxxxxx>


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1860436

Title:
  [ovn] Agent liveness checks are flaky and report false positives

Status in neutron:
  Fix Released

Bug description:
  The way that networking-ovn mech driver performs health checks on
  agents reports false positives due to race conditions:

  1) neutron-server increments the nb_cfg in NB_Global table from X to X+1
  2) neutron-server almost immediately checks all the Chassis rows to see if they have written (X+1) . [1]
  3) neutron-server process the updates from each agent from X to X+1

  *Most* of the times, in step number 2, this condition doesn't hold so
  the timestamp is not updated. The result is that after 60 seconds
  (agent timeout default value), the agent is shown as dead. Sometimes,
  3) happens before 2) so the timestamp gets updated and all is fine but
  this is not the normal case:

  
  1) Bump of nb_cfg
  2020-01-21 11:35:59.534 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36915
  2020-01-21 11:35:59.538 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] XXX nb_cfg = 36916

  
  2) Check of each chassis ext_id against our new bumped nb_cfg: 
  2020-01-21 11:35:59.539 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
  2020-01-21 11:35:59.540 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
  2020-01-21 11:35:59.541 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
  2020-01-21 11:35:59.542 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
  2020-01-21 11:35:59.543 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
  2020-01-21 11:35:59.544 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915
  2020-01-21 11:35:59.546 28 INFO networking_ovn.ml2.mech_driver [req-26facb52-1a67-4897-93d2-fa09ec91a0eb 0295049dc9ac49dfa4a6cd909808ce16 d974caee8132421190dc790ebad401cc - default default] YYY Global nb_cfg = 36916   chassis nb_cfg = 36915

  
  3) Processing updates [2] in the ChassisEvent (some are even older!)
  2020-01-21 11:35:59.546 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36915
  2020-01-21 11:35:59.548 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36915
  2020-01-21 11:35:59.556 32 INFO networking_ovn.ovsdb.ovsdb_monitor [req-efa34cac-2296-4d30-b153-9630b0309fcd - - - - -] XXX chassis update:
  2020-01-21 11:35:59.556 27 INFO networking_ovn.ovsdb.ovsdb_monitor [req-91f7d181-bfa3-4646-9814-bb680d011081 - - - - -] XXX chassis update:
  2020-01-21 11:35:59.557 25 INFO networking_ovn.ovsdb.ovsdb_monitor [req-420e5a25-13e4-4da6-8277-8a3a1028c9e9 - - - - -] XXX chassis update:
  2020-01-21 11:35:59.756 30 INFO networking_ovn.ovsdb.ovsdb_monitor [req-1906156e-a089-4bde-b9bc-c9f4f9655a3d - - - - -] XXX chassis update: 36916
  2020-01-21 11:35:59.778 29 INFO networking_ovn.ovsdb.ovsdb_monitor [req-072386aa-87e9-486c-bb6f-3dd2bdc038bd - - - - -] XXX chassis update: 36916

  IMO, we need to space the bump of nb_cfg [2] and the check [3] in time
  as the NB_Global changes needs to be propagated to the SB, processed
  by all agents and then back to neutron-server which needs to process
  the JSON stuff and update the internal tables. So even if it's fast,
  most of the times it is not fast enough.

  Another solution is to allow a difference of '1' to update timestamps.
   

  [0] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1093
  [1] https://opendev.org/openstack/networking-ovn/src/branch/master/networking_ovn/ml2/mech_driver.py#L1098
  [2] https://github.com/openstack/networking-ovn/blob/bf577e5a999f7db4cb9b790664ad596e1926d9a0/networking_ovn/ml2/mech_driver.py#L988
  [3] https://github.com/openstack/networking-ovn/blob/6302298e9c4313f1200c543c89d92629daff9e89/networking_ovn/ovsdb/ovsdb_monitor.py#L74

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1860436/+subscriptions


References