yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #68863
[Bug 1727845] [NEW] Database and compute nodes can fall out of sync after live migration
Public bug reported:
Description
===========
When performing a rolling live-migrate-and-reboot of a series of compute hosts, we have seen some (rare) occurrances where the instance-record in the database have not been updated with the new host. Another symptom is that the affected instance will be without network connectivity.
A workaround when this occurs is to:
1) Manually set the correct host for the instance in the database.
2) Perform a live-migration of the instance again.
After a successful live-migration, the instance will again regain
network connectivity.
Steps to reproduce
==================
This is a rare occurrance, and in my experience happens 1/100 times or so.
1) Disable hypervisor
2) Evacuate hypervisor by live migrating all instances from it
3) Reboot
4) Enable hypervisor
5) Repeat on next hypervisor
Expected result
===============
No problems should have occurred.
Actual result
=============
A single instance remained without network connectivity, and the nova-compute log-file on two hypervisors started logging lines similar to this:
2017-10-26 12:38:43.568 10858 WARNING nova.compute.manager [req-
6c999e79-afe7-4294-b537-fa6a13f2a791 - - - - -] While synchronizing
instance power states, found 7 instances in the database and 6 instances
on the hypervisor.
We also have icinga-probes set up to list instances that are running on
the host, but not marked as such in the database, and the other way
around - and they went off.
Environment
===========
1. The RDO packages, on Ocata:
openstack-nova-compute-15.0.7-1.el7.noarch
python2-novaclient-7.1.2-1.el7.noarch
openstack-nova-common-15.0.7-1.el7.noarch
python-nova-15.0.7-1.el7.noarch
2. Libvirt KVM on CentOS 7.4 (running a 7.3 kernel due to an unrelated bug with vxlan -- 3.10.0-514.10.2.el7.x86_64)
qemu-kvm-ev-2.9.0-16.el7_4.5.1.x86_64
libvirt-daemon-3.2.0-14.el7_4.3.x86_64
2. We used Cinder with the Dell EMC Unity driver on Fiber channel.
The cinder-node runs the RDO packages, on Ocata:
openstack-cinder-10.0.5-1.el7.noarch
3. Neutron, vxlan/linux-bridge
I expect to follow up on this bug with more information when we can
reproduce the problem in a more controlled fashion.
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1727845
Title:
Database and compute nodes can fall out of sync after live migration
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
When performing a rolling live-migrate-and-reboot of a series of compute hosts, we have seen some (rare) occurrances where the instance-record in the database have not been updated with the new host. Another symptom is that the affected instance will be without network connectivity.
A workaround when this occurs is to:
1) Manually set the correct host for the instance in the database.
2) Perform a live-migration of the instance again.
After a successful live-migration, the instance will again regain
network connectivity.
Steps to reproduce
==================
This is a rare occurrance, and in my experience happens 1/100 times or so.
1) Disable hypervisor
2) Evacuate hypervisor by live migrating all instances from it
3) Reboot
4) Enable hypervisor
5) Repeat on next hypervisor
Expected result
===============
No problems should have occurred.
Actual result
=============
A single instance remained without network connectivity, and the nova-compute log-file on two hypervisors started logging lines similar to this:
2017-10-26 12:38:43.568 10858 WARNING nova.compute.manager [req-
6c999e79-afe7-4294-b537-fa6a13f2a791 - - - - -] While synchronizing
instance power states, found 7 instances in the database and 6
instances on the hypervisor.
We also have icinga-probes set up to list instances that are running
on the host, but not marked as such in the database, and the other way
around - and they went off.
Environment
===========
1. The RDO packages, on Ocata:
openstack-nova-compute-15.0.7-1.el7.noarch
python2-novaclient-7.1.2-1.el7.noarch
openstack-nova-common-15.0.7-1.el7.noarch
python-nova-15.0.7-1.el7.noarch
2. Libvirt KVM on CentOS 7.4 (running a 7.3 kernel due to an unrelated bug with vxlan -- 3.10.0-514.10.2.el7.x86_64)
qemu-kvm-ev-2.9.0-16.el7_4.5.1.x86_64
libvirt-daemon-3.2.0-14.el7_4.3.x86_64
2. We used Cinder with the Dell EMC Unity driver on Fiber channel.
The cinder-node runs the RDO packages, on Ocata:
openstack-cinder-10.0.5-1.el7.noarch
3. Neutron, vxlan/linux-bridge
I expect to follow up on this bug with more information when we can reproduce the problem in a more controlled fashion.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1727845/+subscriptions
Follow ups