yahoo-eng-team team mailing list archive
  
  - 
     yahoo-eng-team team yahoo-eng-team team
- 
    Mailing list archive
  
- 
    Message #95049
  
 [Bug 2017748] Re: [SRU] OVN: ovnmeta namespaces	missing during scalability test causing DHCP issues
  
Hi Mauricio,
I have uploaded a new debdiff to delivery your comment, I use the
verison 2:20.5.0-0ubuntu2.1 and add DEP-3 headers now. As for the 'Test
Case' section, I have been trying to find a reproducer for the past two
weeks, but all attempts have failed, it's quite difficult to reproduce
the behavior that ovsdb will merge insert and update notifications.
The essence of this fix patch is to address the problem through a retry
mechanism.
Without this fix patch:
1, MetadataAgent#start call sync, then call provision_datapath to trigger the initial creation of metadata namespace.
2, only ROW_UPDATE event in PortBindingChassisCreatedEvent can trigger provision_datapath the second time.
3, no more ROW_UPDATE event in PortBindingChassisCreatedEvent can trigger provision_datapath the third time due to the comment [2] (and not old.chassis).
class PortBindingChassisCreatedEvent(PortBindingChassisEvent):
    def init(self, metadata_agent):
        events = (self.ROW_UPDATE,)                            
        super(PortBindingChassisCreatedEvent, self).init(metadata_agent, events)
    def match_fn(self, event, row, old):
        return (row.chassis[0].name == self.agent.chassis and not old.chassis)
        
As said in comment[2]:
What this means is that our PortBindingUpdatedEvent (or
PortBindingChassisCreatedEvent) which looks for "update" events don't
fire when we get a Port_Binding "create" that has the chassis field set.
Yes, that's because match_fn includes the condition 'not old.chassis',
1, openstack port create --network private --fixed-ip subnet=private_subnet $PORT_NAME
When creating a port, we can see an insert event via 'ovsdb monitor', but nothing from the neutron-metadata-agent log since PortBindingChassisCreatedEvent doesn't monitor ROW_INSERT
2, openstack server add port cirros-0.4.0-054427 $PORT_NAME
When adding a port to a VM, we can see an update event via 'ovsdb monitor' and logs from neutron-metadata-agent log as well because it meets the condition (row.chassis[0].name == self.agent.chassis and not old.chassis)
3, openstack port set $PORT_NAME --fixed-ip subnet=private_subnet,ip-address=192.168.21.$((RANDOM % 255 + 1))
When updating a port's IP, we can see an update event via 'ovsdb monitor', but nothing from neutron-metadata-agent log because it doesn't meet the condition (and not old.chassis)
In other words, the current condition (ROW_UPDATE and
row.chassis[0].name == self.agent.chassis and not old.chassis) only
gives provision_datapath one chance to run. if an issue occurs with
ovsdb at this time, subsequent ovsdb update events like above step 3
will not give provision_datapath another change to run.
The current fix patch [1] changes the condtion from
ROW_UPDATE and row.chassis[0].name == self.agent.chassis and not
old.chassis
to
(ROW_INSERT or ROW_UPDATE) and row.chassis[0].name == self.agent.chassis
and not old.chassis
Thus, now provision_datapath has two chances to run, making this patch
act as a retry mechanism, and of course it can solve the problem.
Anyway, since I cannot reproduce the problem, I can only theretically
say that this patch can solve the problem as mentioned above. Even if it
cannot solve the problem, it is theretically harmess, and we have also
ensured it through charmed-openstack-tester. I hope these can help.
thanks a lot.
[1] https://review.opendev.org/c/openstack/neutron/+/926656
[2] https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/2017748/comments/6
** Changed in: neutron (Ubuntu Jammy)
       Status: Incomplete => In Progress
** Changed in: neutron (Ubuntu Focal)
       Status: Won't Fix => In Progress
-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2017748
Title:
  [SRU] OVN:  ovnmeta namespaces missing during scalability test causing
  DHCP issues
Status in Ubuntu Cloud Archive:
  Fix Released
Status in Ubuntu Cloud Archive antelope series:
  New
Status in Ubuntu Cloud Archive bobcat series:
  New
Status in Ubuntu Cloud Archive caracal series:
  Fix Released
Status in Ubuntu Cloud Archive dalmation series:
  Fix Released
Status in Ubuntu Cloud Archive epoxy series:
  Fix Released
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  Won't Fix
Status in neutron:
  New
Status in neutron ussuri series:
  Fix Released
Status in neutron victoria series:
  New
Status in neutron wallaby series:
  New
Status in neutron xena series:
  New
Status in neutron package in Ubuntu:
  Fix Released
Status in neutron source package in Focal:
  In Progress
Status in neutron source package in Jammy:
  In Progress
Status in neutron source package in Noble:
  Fix Released
Status in neutron source package in Oracular:
  Fix Released
Status in neutron source package in Plucky:
  Fix Released
Bug description:
  [Impact]
  ovnmeta- namespaces are missing intermittently then can't reach to VMs
  [Test Case]
  Not able to reproduce this easily, so I run charmed-openstack-tester, the result is below:
  ======                                                                                     
  Totals                                                                                     
  ======                                                                                     
  Ran: 469 tests in 4273.6309 sec.                                                           
   - Passed: 398                                                                             
   - Skipped: 69                                                                             
   - Expected Fail: 0                                                                        
   - Unexpected Success: 0                                                                   
   - Failed: 2                                                                               
  Sum of execute time for each test: 4387.2727 sec. 
  2 failed tests
  (tempest.api.object_storage.test_account_quotas.AccountQuotasTest and
  octavia_tempest_plugin.tests.scenario.v2.test_traffic_ops.TrafficOperationsScenarioTest)
  is not related to the fix
  [Where problems could occur]
  This patches are related to ovn metadata agent in compute.
  VM's connectivity can possibly be affected by this patch when ovn is used.
  Biding port to datapath could be affected.
  [Others]
  == ORIGINAL DESCRIPTION ==
  Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=2187650
  During a scalability test it was noted that a few VMs where having
  issues being pinged (2 out of ~5000 VMs in the test conducted). After
  some investigation it was found that the VMs in question did not
  receive a DHCP lease:
  udhcpc: no lease, failing
  FAIL
  checking http://169.254.169.254/2009-04-04/instance-id
  failed 1/20: up 181.90. request failed
  And the ovnmeta- namespaces for the networks that the VMs was booting
  from were missing. Looking into the ovn-metadata-agent.log:
  2023-04-18 06:56:09.864 353474 DEBUG neutron.agent.ovn.metadata.agent
  [-] There is no metadata port for network
  9029c393-5c40-4bf2-beec-27413417eafa or it has no MAC or IP addresses
  configured, tearing the namespace down if needed _get_provision_params
  /usr/lib/python3.9/site-
  packages/neutron/agent/ovn/metadata/agent.py:495
  Apparently, when the system is under stress (scalability tests) there
  are some edge cases where the metadata port information has not yet
  being propagated by OVN to the Southbound database and when the
  PortBindingChassisEvent event is being handled and try to find either
  the metadata port of the IP information on it (which is updated by
  ML2/OVN during subnet creation) it can not be found and fails silently
  with the error shown above.
  Note that, running the same tests but with less concurrency did not
  trigger this issue. So only happens when the system is overloaded.
To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/2017748/+subscriptions
References