← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1815989] Re: OVS drops RARP packets by QEMU upon live-migration causes up to 40s ping pause in Rocky

 

Reviewed:  https://review.opendev.org/c/openstack/nova/+/602432
Committed: https://opendev.org/openstack/nova/commit/a62dd42c0dbb6b2ab128e558e127d76962738446
Submitter: "Zuul (22348)"
Branch:    master

commit a62dd42c0dbb6b2ab128e558e127d76962738446
Author: Stephen Finucane <stephenfin@xxxxxxxxxx>
Date:   Fri Apr 30 12:51:35 2021 +0100

    libvirt: Delegate OVS plug to os-vif
    
    os-vif 1.15.0 added the ability to create an OVS port during plugging
    by specifying the 'create_port' attribute in the 'port_profile' field.
    By delegating port creation to os-vif, we can rely on it's 'isolate_vif'
    config option [1] that will temporarily configure the VLAN to 4095
    (0xfff), which is reserved for implementation use [2] and is used by
    neutron to as a dead VLAN [3]. By doing this, we ensure VIFs are plugged
    securely, preventing guests from accessing other tenants' networks
    before the neutron OVS agent can wire up the port.
    
    This change requires a little dance as part of the live migration flow.
    Since we can't be certain the destination host has a version of os-vif
    that supports this feature, we need to use a sentinel to indicate when
    it does. Typically we would do so with a field in
    'LibvirtLiveMigrateData', such as the 'src_supports_numa_live_migration'
    and 'dst_supports_numa_live_migration' fields used to indicate support
    for NUMA-aware live migration. However, doing this prevents us
    backporting this important fix since o.vo changes are not backportable.
    Instead, we (somewhat evilly) rely on the free-form nature of the
    'VIFMigrateData.profile_json' string field, which stores JSON blobs and
    is included in 'LibvirtLiveMigrateData' via the 'vifs' attribute, to
    transport this sentinel. This is a hack but is necessary to work around
    the lack of a free-form "capabilities" style dict that would allow us do
    backportable fixes to live migration features.
    
    Note that this change has the knock on effect of modifying the XML
    generated for OVS ports: when hybrid plug is false will now be of type
    'ethernet' rather than 'bridge' as before. This explains the larger than
    expected test damage but should not affect users.
    
    [1] https://opendev.org/openstack/os-vif/src/tag/2.4.0/vif_plug_ovs/ovs.py#L90-L93
    [2] https://en.wikipedia.org/wiki/IEEE_802.1Q#Frame_format
    [3] https://answers.launchpad.net/neutron/+question/231806
    
    Change-Id: I11fb5d3ada7f27b39c183157ea73c8b72b4e672e
    Depends-On: Id12486b3127ab4ac8ad9ef2b3641da1b79a25a50
    Closes-Bug: #1734320
    Closes-Bug: #1815989


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1815989

Title:
  OVS drops RARP packets by QEMU upon live-migration causes up to 40s
  ping pause in Rocky

Status in neutron:
  In Progress
Status in OpenStack Compute (nova):
  Fix Released
Status in os-vif:
  Invalid

Bug description:
  This issue is well known, and there were previous attempts to fix it,
  like this one

  https://bugs.launchpad.net/neutron/+bug/1414559

  
  This issue still exists in Rocky and gets worse. In Rocky, nova compute, nova libvirt and neutron ovs agent all run inside containers.

  So far the only simply fix I have is to increase the number of RARP
  packets QEMU sends after live-migration from 5 to 10. To be complete,
  the nova change (not merged) proposed in the above mentioned activity
  does not work.

  I am creating this ticket hoping to get an up-to-date (for Rockey and
  onwards) expert advise on how to fix in nova-neutron.

  
  For the record, below are the time stamps in my test between neutron ovs agent "activating" the VM port and rarp packets seen by tcpdump on the compute. 10 RARP packets are sent by (recompiled) QEMU, 7 are seen by tcpdump, the 2nd last packet barely made through.

  openvswitch-agent.log:

  2019-02-14 19:00:13.568 73453 INFO
  neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent
  [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Port
  57d0c265-d971-404d-922d-963c8263e6eb updated. Details: {'profile': {},
  'network_qos_policy_id': None, 'qos_policy_id': None,
  'allowed_address_pairs': [], 'admin_state_up': True, 'network_id':
  '1bf4b8e0-9299-485b-80b0-52e18e7b9b42', 'segmentation_id': 648,
  'fixed_ips': [

  {'subnet_id': 'b7c09e83-f16f-4d4e-a31a-e33a922c0bac', 'ip_address': '10.0.1.4'}
  ], 'device_owner': u'compute:nova', 'physical_network': u'physnet0', 'mac_address': 'fa:16:3e:de:af:47', 'device': u'57d0c265-d971-404d-922d-963c8263e6eb', 'port_security_enabled': True, 'port_id': '57d0c265-d971-404d-922d-963c8263e6eb', 'network_type': u'vlan', 'security_groups': [u'5f2175d7-c2c1-49fd-9d05-3a8de3846b9c']}
  2019-02-14 19:00:13.568 73453 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-26129036-b514-4fa0-a39f-a6b21de17bb9 - - - - -] Assigning 4 as local vlan for net-id=1bf4b8e0-9299-485b-80b0-52e18e7b9b42

   
  tcpdump for rarp packets:

  [root@overcloud-ovscompute-overcloud-0 nova]# tcpdump -i any rarp -nev
  tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes

  19:00:10.788220 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
  19:00:11.138216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
  19:00:11.588216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
  19:00:12.138217 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
  19:00:12.788216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
  19:00:13.538216 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46
  19:00:14.388320 B fa:16:3e:de:af:47 ethertype Reverse ARP (0x8035), length 62: Ethernet (len 6), IPv4 (len 4), Reverse Request who-is fa:16:3e:de:af:47 tell fa:16:3e:de:af:47, length 46

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1815989/+subscriptions


References