← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2017748] Re: [SRU] OVN: ovnmeta namespaces missing during scalability test causing DHCP issues

 

This bug was fixed in the package neutron - 2:20.5.0-0ubuntu2.1~cloud0
---------------

 neutron (2:20.5.0-0ubuntu2.1~cloud0) focal-yoga; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:20.5.0-0ubuntu2.1) jammy; urgency=medium
 .
   * Under heavy load, OVN metadata notifications can be held up
     leading to ovsdb-server merging insert and update notifications.
     This can lead to metadata port being missing for some VMs which
     breaks connectivity, e.g. missing DHCP leases. (LP: #2017748)
     - d/p/lp2017748-handle-creation-of-Port_Binding-with-chassis-set.patch


** Changed in: cloud-archive/yoga
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2017748

Title:
  [SRU] OVN:  ovnmeta namespaces missing during scalability test causing
  DHCP issues

Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive antelope series:
  Won't Fix
Status in Ubuntu Cloud Archive bobcat series:
  Won't Fix
Status in Ubuntu Cloud Archive caracal series:
  Fix Released
Status in Ubuntu Cloud Archive dalmatian series:
  Fix Released
Status in Ubuntu Cloud Archive epoxy series:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  Fix Released
Status in Ubuntu Cloud Archive zed series:
  Won't Fix
Status in neutron:
  New
Status in neutron ussuri series:
  Fix Released
Status in neutron victoria series:
  New
Status in neutron wallaby series:
  New
Status in neutron xena series:
  New
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Focal:
  In Progress
Status in neutron source package in Jammy:
  Fix Released
Status in neutron source package in Noble:
  Fix Released
Status in neutron source package in Oracular:
  Fix Released
Status in neutron source package in Plucky:
  Fix Released

Bug description:
  [Impact]

  During scalability tests where extreme load is generated by creating thousands
  of VMs all at the same time, some VMs fail to get a DHCP lease and cannot be
  pinged or sshed to after deployment.

  The ovnmeta namespaces for networks that the VMs were created in are missing.
  The following lines are present in neutron-ovn-metadata-agent.log:

  2024-02-29 03:33:18.297 1080866 INFO neutron.agent.ovn.metadata.agent [-] Port 9a75c431-42c4-47bf-af0d-22e0d5ee11a8 in datapath 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 bound to our chassis
  2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent [-] There is no metadata port for network 3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses configured, tearing the namespace down if needed _get_provision_params /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494

  What is happening is that under extreme load, sometimes the metadata port
  information has not been propagated by OVN to the Southbound database, which
  usually takes the form of a update notification, and when
  PortBindingChassisEvent event is triggered in ovn-metadata-agent, it only looks
  for update notifications, finds none, so it doesn't know any metadata port or IP
  information, fails, logs the message above, and tears down the ovnmetadata
  namespace for that VM.

  Eventually ovsdb-server catches up, and merges insert and update notifications
  and sends them out as a insert notification, which PortBindingChassisEvent
  currently ignores, and the metadata is never applied to the VM.

  This is a race condition, and it doesn't happen when under normal conditions,
  as the metadata would just be delivered a update notification.

  The fix is to also listen for insert notifications, and act on them.

  [Test Case]

  This can't be reproduced in the lab, even after many attempts.

  A user sees this issue daily in production, where they run a scalability test
  every night, in which they create a new tenant, create all necessary resources
  (networks, subnets, routers, load balancers, etc.) and start several thousand
  VMs. They then audit the deployment and verify that everything deployed
  correctly.

  Most days there are a small number of VMs that are unreachable, and those VMs
  have the following messages in neutron-ovn-metadata-agent.log:

  2024-02-29 03:33:18.306 1080866 DEBUG neutron.agent.ovn.metadata.agent
  [-] There is no metadata port for network
  3be5f44d-39de-4c38-a77f-06c0d9ee42b0 or it has no MAC or IP addresses
  configured, tearing the namespace down if needed _get_provision_params
  /usr/lib/python3/dist-packages/neutron/agent/ovn/metadata/agent.py:494

  There are test packages available in:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf375454-updates

  Some previous test packages have been running in the user's test environment for
  several months, with zero metadata namespace issues since rollout. We issued
  the user a hotfix and it has been running in production for the past month
  and they have also had zero metadata namespace issues since rollout.

  When this enters -proposed, it will be verified in the user's production
  environment and subject to their nightly runs of their scalability tests, with
  the results collected after a week or so of runs. After that we should be
  confident the -proposed packages fix the issue.

  Additionally, runs will be done with charmed-openstack-tester between
  -updates and -proposed to see if there are any differences in test
  execution.

  [Where problems could occur]

  We are changing ovn-metadata-agent in neutron, and any issues would be limtied
  to ovn-metadata-agent only. ovn-metadata-agent will now listen for both
  insert and update notifications by ovsdb-server, instead of just update
  notifications beforehand. It shouldn't impact any existing functionality.

  If a regression were to occur, it would affect attaching metadata namespaces to
  newly created VMs, which prevents it from getting its initial metadata URL /
  DHCP lease / IP address information, which would cause connectivity issues for
  newly created VMs. It shouldn't impact any existing VMs.

  There are no workarounds if a regression were to occur, other than to downgrade
  the package.

  [Other info]

  This was fixed upstream by:

  commit a641e8aec09c1e33a15a34b19d92675ed2c85682
  From: Terry Wilson <twilson@xxxxxxxxxx>
  Date: Fri, 15 Dec 2023 21:00:43 +0000
  Subject: Handle creation of Port_Binding with chassis set
  Link: https://opendev.org/openstack/neutron/commit/a641e8aec09c1e33a15a34b19d92675ed2c85682

  This patch landed in Caracal. The patch is for Zed, Antelope and Bobcat, but it
  depends on the following commit:

  commit 6801589510242affc78497660d34377603774074
  From: Jakub Libosvar <libosvar@xxxxxxxxxx>
  Date: Thu, 21 Sep 2023 19:40:36 +0000
  Subject: ovn-metadata: Refactor events
  Link: https://opendev.org/openstack/neutron/commit/6801589510242affc78497660d34377603774074

  After some discussion, we (mruffell, brian-haley, hopem) decided that it would
  be too much of a regression risk to backport "ovn-metadata: Refactor events"
  to Zed, Antelope and Bobcat, we marked this "Won't fix".

  Now, the user is on yoga, so, Brian Haley wrote a new backport that does not
  depend on "ovn-metadata: Refactor events" which is the following commit in
  neutron yoga:

  commit 952e960414e7c15d4d4351bf2300ce53a69e4051
  From: Terry Wilson <twilson@xxxxxxxxxx>
  Date: Tue, 20 Aug 2024 10:20:52 -0500
  Subject: Handle creation of Port_Binding with chassis set
  Link: https://opendev.org/openstack/neutron/commit/952e960414e7c15d4d4351bf2300ce53a69e4051

  This is what we are suggesting for SRU to jammy / yoga.

  There is a low chance of an upgrade regression for users going from yoga -> zed
  -> antelope -> bobcat -> caracal (fixed), due to users likely not running
  heavy stress tests during series upgrade, and would likely run heavy
  stress tests when they land on caracal instead.

  If we have to, we will consider zed, antelope, bobcat in the future, but for
  now, just yoga only.

  == ORIGINAL DESCRIPTION ==

  Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650

  During a scalability test it was noted that a few VMs where having
  issues being pinged (2 out of ~5000 VMs in the test conducted). After
  some investigation it was found that the VMs in question did not
  receive a DHCP lease:

  udhcpc: no lease, failing
  FAIL
  checking http://169.254.169.254/2009-04-04/instance-id
  failed 1/20: up 181.90. request failed

  And the ovnmeta- namespaces for the networks that the VMs was booting
  from were missing. Looking into the ovn-metadata-agent.log:

  2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent
  [-] There is no metadata port for network
  9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses
  configured, tearing the namespace down if needed _get_provision_params
  /usr/lib/python3.9/site-
  packages/neutron/agent/ovn/metadata/agent.py:495

  Apparently, when the system is under stress (scalability tests) there
  are some edge cases where the metadata port information has not yet
  being propagated by OVN to the Southbound database and when the
  PortBindingChassisEvent event is being handled and try to find either
  the metadata port of the IP information on it (which is updated by
  ML2/OVN during subnet creation) it can not be found and fails silently
  with the error shown above.

  Note that, running the same tests but with less concurrency did not
  trigger this issue. So only happens when the system is overloaded.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2017748/+subscriptions



References