← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1702635] [NEW] SR-IOV: sometimes a port may hang in BUILD state

 

Public bug reported:

Scenario:

1) vfio-pci driver is used for VFs
2) 2 ports are created in neutron with binding type 'direct'
3) VMs are spawned and deleted on 2 compute nodes using pre-created ports
4) one neutron port may be bound to different compute nodes at different
   moments
5) for some reason (probably a bug, but current bug is not about it)
   vfio-pci is not properly handling VF reset after VM deletion and for
   sriov agent it looks like some port's MAC is still mapped to some PCI
   slot though the port is not bound to the node
6) sriov agent requests port info from server with
   get_devices_details_list() but doesn't specify 'host' in parameters
7) in this case neutron server sets this port to BUILD, though it may be
   bound to another host:

    def _get_new_status(self, host, port_context):
        port = port_context.current
        if not host or host == port_context.host:
            new_status = (n_const.PORT_STATUS_BUILD if port['admin_state_up']
                          else n_const.PORT_STATUS_DOWN)
            if port['status'] != new_status:
                return new_status

8) after processing, the agent notifies server with update_device_list() and this time specifies 'host' parameter
9) server detects port's and agent's host mismatch and doesn't update status of the port
10) port stays in BUILD state

A simple fix would be to specify host at step 6 - in this case neutron
server won't set port's status to BUILD because of host mismatch.

** Affects: neutron
     Importance: Medium
     Assignee: Oleg Bondarev (obondarev)
         Status: Confirmed


** Tags: sriov-pci-pt

** Description changed:

  Scenario:
  
  1) vfio-pci driver is used for VFs
  2) 2 ports are created in neutron with binding type 'direct'
  3) VMs are spawned and deleted on 2 compute nodes using pre-created ports
- 4) one neutron port may be bound to different compute nodes at different moments
- 5) for some reason (probably a bug, but current bug is not about it) vfio-pci is not properly 
-    handling VF reset after VM deletion and for sriov agent it looks like some port's MAC is 
-    still mapped to some PCI slot though the port is not bound to the node
- 6) sriov agent requests port info from server with get_devices_details_list() but doesn't specify 'host' in parameters
- 7) in this case neutron server sets this port to BUILD, though it may be bound to another host:
+ 4) one neutron port may be bound to different compute nodes at different 
+    moments
+ 5) for some reason (probably a bug, but current bug is not about it) 
+    vfio-pci is not properly handling VF reset after VM deletion and for 
+    sriov agent it looks like some port's MAC is still mapped to some PCI 
+    slot though the port is not bound to the node
+ 6) sriov agent requests port info from server with 
+    get_devices_details_list() but doesn't specify 'host' in parameters
+ 7) in this case neutron server sets this port to BUILD, though it may be 
+    bound to another host:
  
-     def _get_new_status(self, host, port_context):
-         port = port_context.current
+     def _get_new_status(self, host, port_context):
+         port = port_context.current
  >>      if not host or host == port_context.host:
-             new_status = (n_const.PORT_STATUS_BUILD if port['admin_state_up']
-                           else n_const.PORT_STATUS_DOWN)
-             if port['status'] != new_status:
-                 return new_status
+             new_status = (n_const.PORT_STATUS_BUILD if port['admin_state_up']
+                           else n_const.PORT_STATUS_DOWN)
+             if port['status'] != new_status:
+                 return new_status
  
  8) after processing, the agent notifies server with update_device_list() and this time specifies 'host' parameter
  9) server detects port's and agent's host mismatch and doesn't update status of the port
  10) port stays in BUILD state
  
  A simple fix would be to specify host at step 6 - in this case neutron
  server won't set port's status to BUILD because of host mismatch.

** Description changed:

  Scenario:
  
  1) vfio-pci driver is used for VFs
  2) 2 ports are created in neutron with binding type 'direct'
  3) VMs are spawned and deleted on 2 compute nodes using pre-created ports
- 4) one neutron port may be bound to different compute nodes at different 
-    moments
- 5) for some reason (probably a bug, but current bug is not about it) 
-    vfio-pci is not properly handling VF reset after VM deletion and for 
-    sriov agent it looks like some port's MAC is still mapped to some PCI 
-    slot though the port is not bound to the node
- 6) sriov agent requests port info from server with 
-    get_devices_details_list() but doesn't specify 'host' in parameters
- 7) in this case neutron server sets this port to BUILD, though it may be 
-    bound to another host:
+ 4) one neutron port may be bound to different compute nodes at different
+    moments
+ 5) for some reason (probably a bug, but current bug is not about it)
+    vfio-pci is not properly handling VF reset after VM deletion and for
+    sriov agent it looks like some port's MAC is still mapped to some PCI
+    slot though the port is not bound to the node
+ 6) sriov agent requests port info from server with
+    get_devices_details_list() but doesn't specify 'host' in parameters
+ 7) in this case neutron server sets this port to BUILD, though it may be
+    bound to another host:
  
      def _get_new_status(self, host, port_context):
          port = port_context.current
- >>      if not host or host == port_context.host:
+         if not host or host == port_context.host:
              new_status = (n_const.PORT_STATUS_BUILD if port['admin_state_up']
                            else n_const.PORT_STATUS_DOWN)
              if port['status'] != new_status:
                  return new_status
  
  8) after processing, the agent notifies server with update_device_list() and this time specifies 'host' parameter
  9) server detects port's and agent's host mismatch and doesn't update status of the port
  10) port stays in BUILD state
  
  A simple fix would be to specify host at step 6 - in this case neutron
  server won't set port's status to BUILD because of host mismatch.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1702635

Title:
  SR-IOV: sometimes a port may hang in BUILD state

Status in neutron:
  Confirmed

Bug description:
  Scenario:

  1) vfio-pci driver is used for VFs
  2) 2 ports are created in neutron with binding type 'direct'
  3) VMs are spawned and deleted on 2 compute nodes using pre-created ports
  4) one neutron port may be bound to different compute nodes at different
     moments
  5) for some reason (probably a bug, but current bug is not about it)
     vfio-pci is not properly handling VF reset after VM deletion and for
     sriov agent it looks like some port's MAC is still mapped to some PCI
     slot though the port is not bound to the node
  6) sriov agent requests port info from server with
     get_devices_details_list() but doesn't specify 'host' in parameters
  7) in this case neutron server sets this port to BUILD, though it may be
     bound to another host:

      def _get_new_status(self, host, port_context):
          port = port_context.current
          if not host or host == port_context.host:
              new_status = (n_const.PORT_STATUS_BUILD if port['admin_state_up']
                            else n_const.PORT_STATUS_DOWN)
              if port['status'] != new_status:
                  return new_status

  8) after processing, the agent notifies server with update_device_list() and this time specifies 'host' parameter
  9) server detects port's and agent's host mismatch and doesn't update status of the port
  10) port stays in BUILD state

  A simple fix would be to specify host at step 6 - in this case neutron
  server won't set port's status to BUILD because of host mismatch.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1702635/+subscriptions


Follow ups