← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1822884] [NEW] live migration fails due to port binding duplicate key entry in post_live_migrate

 

Public bug reported:

We are converting a site from RDO to OSA; At this stage all control
nodes and net nodes are running OSA (Rocky), some compute are running
RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).

We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to OSA
(Rocky) before reinstalling the RDO nodes as OSA (Rocky).

When Live Migrating between RDO nodes we see no issues similarly when
migrating between OSA nodes, we see no issue, however Live Migrating RDO
-> OSA fails with the below error on the target.

2019-01-24 13:33:11.701 85926 INFO nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Updating port 419c18e1-05c0-44c3-9a97-8334d0c15cc1 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute24-kna1'}
2019-01-24 13:33:59.357 85926 ERROR nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Unable to update binding details for port 419c18e1-05c0-44c3-9a97-8334d0c15cc1: InternalServerError: Request Failed: internal server error while processing your request.

Digging further into the logs, reveals an issue with duplicate keys:

2019-02-01 09:48:10.268 11854 ERROR oslo_db.api [req-152bce20-8895-4238-910c-b26fde44913d e7bbce5e15994104bdf5e3af68a55b31 a894e8109af3430aa7ae03e0c49a0aa0 - default default] DB exceeded retry limit.: DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '5bedceef-1b65-4b19-a6d4-da3595adaf61-cc-com
pute24-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=%(profile)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s'] [parameters: {'profile': '{}', 'vif_ty
pe': 'unbound', 'ml2_port_bindings_host': u'cc-compute06-kna1', 'vif_details': '', 'ml2_port_bindings_port_id': u'5bedceef-1b65-4b19-a6d4-da3595adaf61', 'host': u'cc-compute24-kna1'}] (Background on this error at: http://sqlalche.me/e/gkpj)

this is confirmed when reviewing the ml2_port_bindings table:

MariaDB [neutron]> select * from ml2_port_bindings where port_id = '5bedceef-1b65-4b19-a6d4-da3595adaf61';
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
| port_id                              | host              | vif_type | vnic_type | vif_details                                                               | profile                               | status   |
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
| 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute06-kna1 | ovs      | normal    | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | {"migrating_to": "cc-compute24-kna1"} | ACTIVE   |
| 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute24-kna1 | ovs      | normal    | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} |                                       | INACTIVE |
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+

The exception is not caught and handled, the VM is stuck in migrating.
According to OpenStack, the VM is still on the source compute node,
whilst libvirt/virsh believes it to be on the target. Forcing the VM
state to active, keeps the VM available, however rebooting will result
in an ERROR state (this is resolved by destroying the VM in virsh on the
target, forcing back to active state, and rebooting) nor can it be
attempted to be migrated due to the error state in the DB (this can be
fixed by manually removing the inactive port, and clearing the profile
from the active port).

In discussions with both mnasser and sean-k-mooney, it is understood
that there are two distinct live migration flows with regards to port
binding

1) the "old" flow: the port is deactivated on the source before being activated on the target - meaning only one entry in the ml2_port_bindings tables, at the expense of added network outage during live migration
2) the "new" flow: an inactive port is added to the target, before the old port is removed and the new port activated

We can see is monitoring the ml2_port_binding table during live
migrations that

1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)
2. OSA -> OSA uses the new flow (two entries which are cleaned up)
3. RDO -> OSA use the new flow, two entries, which are not cleaned up

This is unexpected, as even in the RDO to RDO case, both nodes are Rocky
and so the new process should be in use.

(https://git.openstack.org/cgit/openstack/nova-
specs/tree/specs/queens/approved/neutron-new-port-binding-
api.rst?h=refs/changes/80/375580/16)

Adding more debug statements to nova/network/neutronv2/api.py I can
isolate where the code paths diverge in the RDO - RDO and RDO - OSA
cases.

if port_migrating or teardown:
            LOG.info("[BRUCE](A-3): setup_networks_on_host: port_migrating(%s) or teardown(%s)", port_migrating, teardown)
            search_opts = {'device_id': instance.uuid,
                           'tenant_id': instance.project_id,
                           BINDING_HOST_ID: instance.host}
            # Now get the port details to process the ports
            # binding profile info.
            LOG.info("[BRUCE](A-3.1): setup_networks_on_host: list_ports(context, {device_id(%s), tenant_id(%s), BINDING_HOST_ID(%s)}", instance.uuid, instance.project_id, instance.host)

Looking at the state here

RDO - OSA (2019-02-13 13:41:33.809)                                
------------------------------------------------------------       
{                                                                  
    device_id(8fe8b490-13aa-4fc7-87a0-39706200b904), <----- UUID   
    tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id 
    BINDING_HOST_ID(cc-compute10-kna1) <----- source               
}  

OSA - OSA (2019-02-13 14:33:36.333)                                
-------------------------------------------------------------------
{                                                                  
    device_id(90ca9b4f-ab25-4301-b1a5-9326cb8fd038), <----- UUID   
    tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id 
    BINDING_HOST_ID(cc-compute28-kna1)} <----- source              
}

At this point in the flow, BINDING_HOST_ID is the source hostname

continuing:

data = self.list_ports(context, **search_opts) 
ports = data['ports']
LOG.info("[BRUCE](A-3.2): setup_networks_on_host: ports = %s", ports)

and examining again:

RDO - OSA (2019-02-13 13:41:33.887)                                    
------------------------------------------------------------           
ports =                                                                
[   
....
u'binding:host_id': u'cc-compute10-kna1', 
....
}                                                                  
]     

OSA - OSA: (2019-02-13 14:33:36.422)                                   
-----------------------------------------------------------------------
ports =                                                                
[     
....
u'binding:host_id': u'cc-compute29-kna1',
....
}                                                                  
]   

now we can see that in the RDO - OSA case, the binding:host_id as
returned in the ports is the source hostname; whereas in the OSA - OSA
case, the binding:host_id is the target hostname

The actual fault itself is in:

self.network_api.setup_networks_on_host(context, instance,
                                                         self.host)
        migration = {'source_compute': instance.host,
                     'dest_compute': self.host, }
        self.network_api.migrate_instance_finish(context,
                                                 instance,
                                                 migration)

if we trace those calls in the RDO - OSA case, then we will end up here:

# Avoid rolling back updates if we catch an error above.
        # TODO(lbeliveau): Batch up the port updates in one neutron call.
        for port_id, updates in port_updates:
            if updates:
                LOG.info("Updating port %(port)s with "
                         "attributes %(attributes)s",
                         {"port": port_id, "attributes": updates},
                         instance=instance)
                try:
                    neutron.update_port(port_id, {'port': updates})
                except Exception:
                    with excutils.save_and_reraise_exception():
                        LOG.exception("Unable to update binding details "
                                      "for port %s",
                                      port_id, instance=instance)

in the OSA - OSA case, these calls appear to have no effect, in the RDO
- RDO case they cause the internal error (as they are updating the ports
rather than activating/deleting as expected)

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1822884

Title:
  live migration fails due to port binding duplicate key entry in
  post_live_migrate

Status in OpenStack Compute (nova):
  New

Bug description:
  We are converting a site from RDO to OSA; At this stage all control
  nodes and net nodes are running OSA (Rocky), some compute are running
  RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).

  We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to
  OSA (Rocky) before reinstalling the RDO nodes as OSA (Rocky).

  When Live Migrating between RDO nodes we see no issues similarly when
  migrating between OSA nodes, we see no issue, however Live Migrating
  RDO -> OSA fails with the below error on the target.

  2019-01-24 13:33:11.701 85926 INFO nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Updating port 419c18e1-05c0-44c3-9a97-8334d0c15cc1 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute24-kna1'}
  2019-01-24 13:33:59.357 85926 ERROR nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Unable to update binding details for port 419c18e1-05c0-44c3-9a97-8334d0c15cc1: InternalServerError: Request Failed: internal server error while processing your request.

  Digging further into the logs, reveals an issue with duplicate keys:

  2019-02-01 09:48:10.268 11854 ERROR oslo_db.api [req-152bce20-8895-4238-910c-b26fde44913d e7bbce5e15994104bdf5e3af68a55b31 a894e8109af3430aa7ae03e0c49a0aa0 - default default] DB exceeded retry limit.: DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '5bedceef-1b65-4b19-a6d4-da3595adaf61-cc-com
  pute24-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=%(profile)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s'] [parameters: {'profile': '{}', 'vif_ty
  pe': 'unbound', 'ml2_port_bindings_host': u'cc-compute06-kna1', 'vif_details': '', 'ml2_port_bindings_port_id': u'5bedceef-1b65-4b19-a6d4-da3595adaf61', 'host': u'cc-compute24-kna1'}] (Background on this error at: http://sqlalche.me/e/gkpj)

  this is confirmed when reviewing the ml2_port_bindings table:

  MariaDB [neutron]> select * from ml2_port_bindings where port_id = '5bedceef-1b65-4b19-a6d4-da3595adaf61';
  +--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
  | port_id                              | host              | vif_type | vnic_type | vif_details                                                               | profile                               | status   |
  +--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
  | 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute06-kna1 | ovs      | normal    | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | {"migrating_to": "cc-compute24-kna1"} | ACTIVE   |
  | 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute24-kna1 | ovs      | normal    | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} |                                       | INACTIVE |
  +--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+

  The exception is not caught and handled, the VM is stuck in migrating.
  According to OpenStack, the VM is still on the source compute node,
  whilst libvirt/virsh believes it to be on the target. Forcing the VM
  state to active, keeps the VM available, however rebooting will result
  in an ERROR state (this is resolved by destroying the VM in virsh on
  the target, forcing back to active state, and rebooting) nor can it be
  attempted to be migrated due to the error state in the DB (this can be
  fixed by manually removing the inactive port, and clearing the profile
  from the active port).

  In discussions with both mnasser and sean-k-mooney, it is understood
  that there are two distinct live migration flows with regards to port
  binding

  1) the "old" flow: the port is deactivated on the source before being activated on the target - meaning only one entry in the ml2_port_bindings tables, at the expense of added network outage during live migration
  2) the "new" flow: an inactive port is added to the target, before the old port is removed and the new port activated

  We can see is monitoring the ml2_port_binding table during live
  migrations that

  1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)
  2. OSA -> OSA uses the new flow (two entries which are cleaned up)
  3. RDO -> OSA use the new flow, two entries, which are not cleaned up

  This is unexpected, as even in the RDO to RDO case, both nodes are
  Rocky and so the new process should be in use.

  (https://git.openstack.org/cgit/openstack/nova-
  specs/tree/specs/queens/approved/neutron-new-port-binding-
  api.rst?h=refs/changes/80/375580/16)

  Adding more debug statements to nova/network/neutronv2/api.py I can
  isolate where the code paths diverge in the RDO - RDO and RDO - OSA
  cases.

  if port_migrating or teardown:
              LOG.info("[BRUCE](A-3): setup_networks_on_host: port_migrating(%s) or teardown(%s)", port_migrating, teardown)
              search_opts = {'device_id': instance.uuid,
                             'tenant_id': instance.project_id,
                             BINDING_HOST_ID: instance.host}
              # Now get the port details to process the ports
              # binding profile info.
              LOG.info("[BRUCE](A-3.1): setup_networks_on_host: list_ports(context, {device_id(%s), tenant_id(%s), BINDING_HOST_ID(%s)}", instance.uuid, instance.project_id, instance.host)

  Looking at the state here

  RDO - OSA (2019-02-13 13:41:33.809)                                
  ------------------------------------------------------------       
  {                                                                  
      device_id(8fe8b490-13aa-4fc7-87a0-39706200b904), <----- UUID   
      tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id 
      BINDING_HOST_ID(cc-compute10-kna1) <----- source               
  }  

  OSA - OSA (2019-02-13 14:33:36.333)                                
  -------------------------------------------------------------------
  {                                                                  
      device_id(90ca9b4f-ab25-4301-b1a5-9326cb8fd038), <----- UUID   
      tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id 
      BINDING_HOST_ID(cc-compute28-kna1)} <----- source              
  }

  At this point in the flow, BINDING_HOST_ID is the source hostname

  continuing:

  data = self.list_ports(context, **search_opts) 
  ports = data['ports']
  LOG.info("[BRUCE](A-3.2): setup_networks_on_host: ports = %s", ports)

  and examining again:

  RDO - OSA (2019-02-13 13:41:33.887)                                    
  ------------------------------------------------------------           
  ports =                                                                
  [   
  ....
  u'binding:host_id': u'cc-compute10-kna1', 
  ....
  }                                                                  
  ]     

  OSA - OSA: (2019-02-13 14:33:36.422)                                   
  -----------------------------------------------------------------------
  ports =                                                                
  [     
  ....
  u'binding:host_id': u'cc-compute29-kna1',
  ....
  }                                                                  
  ]   

  now we can see that in the RDO - OSA case, the binding:host_id as
  returned in the ports is the source hostname; whereas in the OSA - OSA
  case, the binding:host_id is the target hostname

  The actual fault itself is in:

  self.network_api.setup_networks_on_host(context, instance,
                                                           self.host)
          migration = {'source_compute': instance.host,
                       'dest_compute': self.host, }
          self.network_api.migrate_instance_finish(context,
                                                   instance,
                                                   migration)

  if we trace those calls in the RDO - OSA case, then we will end up
  here:

  # Avoid rolling back updates if we catch an error above.
          # TODO(lbeliveau): Batch up the port updates in one neutron call.
          for port_id, updates in port_updates:
              if updates:
                  LOG.info("Updating port %(port)s with "
                           "attributes %(attributes)s",
                           {"port": port_id, "attributes": updates},
                           instance=instance)
                  try:
                      neutron.update_port(port_id, {'port': updates})
                  except Exception:
                      with excutils.save_and_reraise_exception():
                          LOG.exception("Unable to update binding details "
                                        "for port %s",
                                        port_id, instance=instance)

  in the OSA - OSA case, these calls appear to have no effect, in the
  RDO - RDO case they cause the internal error (as they are updating the
  ports rather than activating/deleting as expected)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1822884/+subscriptions


Follow ups