yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #77809
[Bug 1822884] [NEW] live migration fails due to port binding duplicate key entry in post_live_migrate
Public bug reported:
We are converting a site from RDO to OSA; At this stage all control
nodes and net nodes are running OSA (Rocky), some compute are running
RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).
We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to OSA
(Rocky) before reinstalling the RDO nodes as OSA (Rocky).
When Live Migrating between RDO nodes we see no issues similarly when
migrating between OSA nodes, we see no issue, however Live Migrating RDO
-> OSA fails with the below error on the target.
2019-01-24 13:33:11.701 85926 INFO nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Updating port 419c18e1-05c0-44c3-9a97-8334d0c15cc1 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute24-kna1'}
2019-01-24 13:33:59.357 85926 ERROR nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Unable to update binding details for port 419c18e1-05c0-44c3-9a97-8334d0c15cc1: InternalServerError: Request Failed: internal server error while processing your request.
Digging further into the logs, reveals an issue with duplicate keys:
2019-02-01 09:48:10.268 11854 ERROR oslo_db.api [req-152bce20-8895-4238-910c-b26fde44913d e7bbce5e15994104bdf5e3af68a55b31 a894e8109af3430aa7ae03e0c49a0aa0 - default default] DB exceeded retry limit.: DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '5bedceef-1b65-4b19-a6d4-da3595adaf61-cc-com
pute24-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=%(profile)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s'] [parameters: {'profile': '{}', 'vif_ty
pe': 'unbound', 'ml2_port_bindings_host': u'cc-compute06-kna1', 'vif_details': '', 'ml2_port_bindings_port_id': u'5bedceef-1b65-4b19-a6d4-da3595adaf61', 'host': u'cc-compute24-kna1'}] (Background on this error at: http://sqlalche.me/e/gkpj)
this is confirmed when reviewing the ml2_port_bindings table:
MariaDB [neutron]> select * from ml2_port_bindings where port_id = '5bedceef-1b65-4b19-a6d4-da3595adaf61';
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
| port_id | host | vif_type | vnic_type | vif_details | profile | status |
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
| 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute06-kna1 | ovs | normal | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | {"migrating_to": "cc-compute24-kna1"} | ACTIVE |
| 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute24-kna1 | ovs | normal | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | | INACTIVE |
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
The exception is not caught and handled, the VM is stuck in migrating.
According to OpenStack, the VM is still on the source compute node,
whilst libvirt/virsh believes it to be on the target. Forcing the VM
state to active, keeps the VM available, however rebooting will result
in an ERROR state (this is resolved by destroying the VM in virsh on the
target, forcing back to active state, and rebooting) nor can it be
attempted to be migrated due to the error state in the DB (this can be
fixed by manually removing the inactive port, and clearing the profile
from the active port).
In discussions with both mnasser and sean-k-mooney, it is understood
that there are two distinct live migration flows with regards to port
binding
1) the "old" flow: the port is deactivated on the source before being activated on the target - meaning only one entry in the ml2_port_bindings tables, at the expense of added network outage during live migration
2) the "new" flow: an inactive port is added to the target, before the old port is removed and the new port activated
We can see is monitoring the ml2_port_binding table during live
migrations that
1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)
2. OSA -> OSA uses the new flow (two entries which are cleaned up)
3. RDO -> OSA use the new flow, two entries, which are not cleaned up
This is unexpected, as even in the RDO to RDO case, both nodes are Rocky
and so the new process should be in use.
(https://git.openstack.org/cgit/openstack/nova-
specs/tree/specs/queens/approved/neutron-new-port-binding-
api.rst?h=refs/changes/80/375580/16)
Adding more debug statements to nova/network/neutronv2/api.py I can
isolate where the code paths diverge in the RDO - RDO and RDO - OSA
cases.
if port_migrating or teardown:
LOG.info("[BRUCE](A-3): setup_networks_on_host: port_migrating(%s) or teardown(%s)", port_migrating, teardown)
search_opts = {'device_id': instance.uuid,
'tenant_id': instance.project_id,
BINDING_HOST_ID: instance.host}
# Now get the port details to process the ports
# binding profile info.
LOG.info("[BRUCE](A-3.1): setup_networks_on_host: list_ports(context, {device_id(%s), tenant_id(%s), BINDING_HOST_ID(%s)}", instance.uuid, instance.project_id, instance.host)
Looking at the state here
RDO - OSA (2019-02-13 13:41:33.809)
------------------------------------------------------------
{
device_id(8fe8b490-13aa-4fc7-87a0-39706200b904), <----- UUID
tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id
BINDING_HOST_ID(cc-compute10-kna1) <----- source
}
OSA - OSA (2019-02-13 14:33:36.333)
-------------------------------------------------------------------
{
device_id(90ca9b4f-ab25-4301-b1a5-9326cb8fd038), <----- UUID
tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id
BINDING_HOST_ID(cc-compute28-kna1)} <----- source
}
At this point in the flow, BINDING_HOST_ID is the source hostname
continuing:
data = self.list_ports(context, **search_opts)
ports = data['ports']
LOG.info("[BRUCE](A-3.2): setup_networks_on_host: ports = %s", ports)
and examining again:
RDO - OSA (2019-02-13 13:41:33.887)
------------------------------------------------------------
ports =
[
....
u'binding:host_id': u'cc-compute10-kna1',
....
}
]
OSA - OSA: (2019-02-13 14:33:36.422)
-----------------------------------------------------------------------
ports =
[
....
u'binding:host_id': u'cc-compute29-kna1',
....
}
]
now we can see that in the RDO - OSA case, the binding:host_id as
returned in the ports is the source hostname; whereas in the OSA - OSA
case, the binding:host_id is the target hostname
The actual fault itself is in:
self.network_api.setup_networks_on_host(context, instance,
self.host)
migration = {'source_compute': instance.host,
'dest_compute': self.host, }
self.network_api.migrate_instance_finish(context,
instance,
migration)
if we trace those calls in the RDO - OSA case, then we will end up here:
# Avoid rolling back updates if we catch an error above.
# TODO(lbeliveau): Batch up the port updates in one neutron call.
for port_id, updates in port_updates:
if updates:
LOG.info("Updating port %(port)s with "
"attributes %(attributes)s",
{"port": port_id, "attributes": updates},
instance=instance)
try:
neutron.update_port(port_id, {'port': updates})
except Exception:
with excutils.save_and_reraise_exception():
LOG.exception("Unable to update binding details "
"for port %s",
port_id, instance=instance)
in the OSA - OSA case, these calls appear to have no effect, in the RDO
- RDO case they cause the internal error (as they are updating the ports
rather than activating/deleting as expected)
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1822884
Title:
live migration fails due to port binding duplicate key entry in
post_live_migrate
Status in OpenStack Compute (nova):
New
Bug description:
We are converting a site from RDO to OSA; At this stage all control
nodes and net nodes are running OSA (Rocky), some compute are running
RDO (Queens), some are RDO (Rocky) and the remaining are OSA (Rocky).
We are attempting to Live Migrate VMs from the RDO (Rocky) nodes to
OSA (Rocky) before reinstalling the RDO nodes as OSA (Rocky).
When Live Migrating between RDO nodes we see no issues similarly when
migrating between OSA nodes, we see no issue, however Live Migrating
RDO -> OSA fails with the below error on the target.
2019-01-24 13:33:11.701 85926 INFO nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Updating port 419c18e1-05c0-44c3-9a97-8334d0c15cc1 with attributes {'binding:profile': {}, 'binding:host_id': 'cc-compute24-kna1'}
2019-01-24 13:33:59.357 85926 ERROR nova.network.neutronv2.api [req-3c4ceac0-7c12-428d-9f63-7c1d6879be4e a3bee416cf67420995855d602d2bccd3 a564613210ee43708b8a7fc6274ebd63 - default default] [instance: 8ecbfc14-2699-4276-a577-18ed6a662232] Unable to update binding details for port 419c18e1-05c0-44c3-9a97-8334d0c15cc1: InternalServerError: Request Failed: internal server error while processing your request.
Digging further into the logs, reveals an issue with duplicate keys:
2019-02-01 09:48:10.268 11854 ERROR oslo_db.api [req-152bce20-8895-4238-910c-b26fde44913d e7bbce5e15994104bdf5e3af68a55b31 a894e8109af3430aa7ae03e0c49a0aa0 - default default] DB exceeded retry limit.: DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry '5bedceef-1b65-4b19-a6d4-da3595adaf61-cc-com
pute24-kna1' for key 'PRIMARY'") [SQL: u'UPDATE ml2_port_bindings SET host=%(host)s, profile=%(profile)s, vif_type=%(vif_type)s, vif_details=%(vif_details)s WHERE ml2_port_bindings.port_id = %(ml2_port_bindings_port_id)s AND ml2_port_bindings.host = %(ml2_port_bindings_host)s'] [parameters: {'profile': '{}', 'vif_ty
pe': 'unbound', 'ml2_port_bindings_host': u'cc-compute06-kna1', 'vif_details': '', 'ml2_port_bindings_port_id': u'5bedceef-1b65-4b19-a6d4-da3595adaf61', 'host': u'cc-compute24-kna1'}] (Background on this error at: http://sqlalche.me/e/gkpj)
this is confirmed when reviewing the ml2_port_bindings table:
MariaDB [neutron]> select * from ml2_port_bindings where port_id = '5bedceef-1b65-4b19-a6d4-da3595adaf61';
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
| port_id | host | vif_type | vnic_type | vif_details | profile | status |
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
| 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute06-kna1 | ovs | normal | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | {"migrating_to": "cc-compute24-kna1"} | ACTIVE |
| 5bedceef-1b65-4b19-a6d4-da3595adaf61 | cc-compute24-kna1 | ovs | normal | {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true} | | INACTIVE |
+--------------------------------------+-------------------+----------+-----------+---------------------------------------------------------------------------+---------------------------------------+----------+
The exception is not caught and handled, the VM is stuck in migrating.
According to OpenStack, the VM is still on the source compute node,
whilst libvirt/virsh believes it to be on the target. Forcing the VM
state to active, keeps the VM available, however rebooting will result
in an ERROR state (this is resolved by destroying the VM in virsh on
the target, forcing back to active state, and rebooting) nor can it be
attempted to be migrated due to the error state in the DB (this can be
fixed by manually removing the inactive port, and clearing the profile
from the active port).
In discussions with both mnasser and sean-k-mooney, it is understood
that there are two distinct live migration flows with regards to port
binding
1) the "old" flow: the port is deactivated on the source before being activated on the target - meaning only one entry in the ml2_port_bindings tables, at the expense of added network outage during live migration
2) the "new" flow: an inactive port is added to the target, before the old port is removed and the new port activated
We can see is monitoring the ml2_port_binding table during live
migrations that
1. RDO -> RDO use the old flow (only one entry in the ml2_port_bindings table at all times)
2. OSA -> OSA uses the new flow (two entries which are cleaned up)
3. RDO -> OSA use the new flow, two entries, which are not cleaned up
This is unexpected, as even in the RDO to RDO case, both nodes are
Rocky and so the new process should be in use.
(https://git.openstack.org/cgit/openstack/nova-
specs/tree/specs/queens/approved/neutron-new-port-binding-
api.rst?h=refs/changes/80/375580/16)
Adding more debug statements to nova/network/neutronv2/api.py I can
isolate where the code paths diverge in the RDO - RDO and RDO - OSA
cases.
if port_migrating or teardown:
LOG.info("[BRUCE](A-3): setup_networks_on_host: port_migrating(%s) or teardown(%s)", port_migrating, teardown)
search_opts = {'device_id': instance.uuid,
'tenant_id': instance.project_id,
BINDING_HOST_ID: instance.host}
# Now get the port details to process the ports
# binding profile info.
LOG.info("[BRUCE](A-3.1): setup_networks_on_host: list_ports(context, {device_id(%s), tenant_id(%s), BINDING_HOST_ID(%s)}", instance.uuid, instance.project_id, instance.host)
Looking at the state here
RDO - OSA (2019-02-13 13:41:33.809)
------------------------------------------------------------
{
device_id(8fe8b490-13aa-4fc7-87a0-39706200b904), <----- UUID
tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id
BINDING_HOST_ID(cc-compute10-kna1) <----- source
}
OSA - OSA (2019-02-13 14:33:36.333)
-------------------------------------------------------------------
{
device_id(90ca9b4f-ab25-4301-b1a5-9326cb8fd038), <----- UUID
tenant_id(a4cac02597bc4013b69a9cd9373f181c), <----- project id
BINDING_HOST_ID(cc-compute28-kna1)} <----- source
}
At this point in the flow, BINDING_HOST_ID is the source hostname
continuing:
data = self.list_ports(context, **search_opts)
ports = data['ports']
LOG.info("[BRUCE](A-3.2): setup_networks_on_host: ports = %s", ports)
and examining again:
RDO - OSA (2019-02-13 13:41:33.887)
------------------------------------------------------------
ports =
[
....
u'binding:host_id': u'cc-compute10-kna1',
....
}
]
OSA - OSA: (2019-02-13 14:33:36.422)
-----------------------------------------------------------------------
ports =
[
....
u'binding:host_id': u'cc-compute29-kna1',
....
}
]
now we can see that in the RDO - OSA case, the binding:host_id as
returned in the ports is the source hostname; whereas in the OSA - OSA
case, the binding:host_id is the target hostname
The actual fault itself is in:
self.network_api.setup_networks_on_host(context, instance,
self.host)
migration = {'source_compute': instance.host,
'dest_compute': self.host, }
self.network_api.migrate_instance_finish(context,
instance,
migration)
if we trace those calls in the RDO - OSA case, then we will end up
here:
# Avoid rolling back updates if we catch an error above.
# TODO(lbeliveau): Batch up the port updates in one neutron call.
for port_id, updates in port_updates:
if updates:
LOG.info("Updating port %(port)s with "
"attributes %(attributes)s",
{"port": port_id, "attributes": updates},
instance=instance)
try:
neutron.update_port(port_id, {'port': updates})
except Exception:
with excutils.save_and_reraise_exception():
LOG.exception("Unable to update binding details "
"for port %s",
port_id, instance=instance)
in the OSA - OSA case, these calls appear to have no effect, in the
RDO - RDO case they cause the internal error (as they are updating the
ports rather than activating/deleting as expected)
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1822884/+subscriptions
Follow ups