yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1727845] Re: Database and compute nodes can fall out of sync after live migration

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Launchpad Bug Tracker <1727845@xxxxxxxxxxxxxxxxxx>
Date: Sun, 22 Nov 2020 04:17:25 -0000
Reply-to: Bug 1727845 <1727845@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

[Expired for OpenStack Compute (nova) because there has been no activity
for 60 days.]

** Changed in: nova
       Status: Incomplete => Expired

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1727845

Title:
  Database and compute nodes can fall out of sync after live migration

Status in OpenStack Compute (nova):
  Expired

Bug description:
  Description
  ===========
  When performing a rolling live-migrate-and-reboot of a series of compute hosts, we have seen some (rare) occurrances where the instance-record in the database have not been updated with the new host.  Another symptom is that the affected instance will be without network connectivity.

  A workaround when this occurs is to:

  1) Manually set the correct host for the instance in the database.
  2) Perform a live-migration of the instance again.

  After a successful live-migration, the instance will again regain
  network connectivity.

  Steps to reproduce
  ==================
  This is a rare occurrance, and in my experience happens 1/100 times or so.

  1) Disable hypervisor
  2) Evacuate hypervisor by live migrating all instances from it
  3) Reboot
  4) Enable hypervisor
  5) Repeat on next hypervisor

  
  Expected result
  ===============
  No problems should have occurred.

  Actual result
  =============
  A single instance remained without network connectivity, and the nova-compute log-file on two hypervisors started logging lines similar to this:

  2017-10-26 12:38:43.568 10858 WARNING nova.compute.manager [req-
  6c999e79-afe7-4294-b537-fa6a13f2a791 - - - - -] While synchronizing
  instance power states, found 7 instances in the database and 6
  instances on the hypervisor.

  We also have icinga-probes set up to list instances that are running
  on the host, but not marked as such in the database, and the other way
  around - and they went off.

  
  Environment
  ===========
  1. The RDO packages, on Ocata:
  openstack-nova-compute-15.0.7-1.el7.noarch
  python2-novaclient-7.1.2-1.el7.noarch
  openstack-nova-common-15.0.7-1.el7.noarch
  python-nova-15.0.7-1.el7.noarch

  
  2. Libvirt KVM on CentOS 7.4 (running a 7.3 kernel due to an unrelated bug with vxlan -- 3.10.0-514.10.2.el7.x86_64)

  qemu-kvm-ev-2.9.0-16.el7_4.5.1.x86_64
  libvirt-daemon-3.2.0-14.el7_4.3.x86_64

  
  2. We used Cinder with the Dell EMC Unity driver on Fiber channel.
  The cinder-node runs the RDO packages, on Ocata:
  openstack-cinder-10.0.5-1.el7.noarch

  
  3. Neutron, vxlan/linux-bridge


  
  I expect to follow up on this bug with more information when we can reproduce the problem in a more controlled fashion.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1727845/+subscriptions

References

[Bug 1727845] [NEW] Database and compute nodes can fall out of sync after live migration
From: Trygve Vea, 2017-10-26