yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1727845] [NEW] Database and compute nodes can fall out of sync after live migration

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Trygve Vea <trygve.vea@xxxxxxxxx>
Date: Thu, 26 Oct 2017 19:48:58 -0000
Reply-to: Bug 1727845 <1727845@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Description
===========
When performing a rolling live-migrate-and-reboot of a series of compute hosts, we have seen some (rare) occurrances where the instance-record in the database have not been updated with the new host.  Another symptom is that the affected instance will be without network connectivity.

A workaround when this occurs is to:

1) Manually set the correct host for the instance in the database.
2) Perform a live-migration of the instance again.

After a successful live-migration, the instance will again regain
network connectivity.

Steps to reproduce
==================
This is a rare occurrance, and in my experience happens 1/100 times or so.

1) Disable hypervisor
2) Evacuate hypervisor by live migrating all instances from it
3) Reboot
4) Enable hypervisor
5) Repeat on next hypervisor


Expected result
===============
No problems should have occurred.

Actual result
=============
A single instance remained without network connectivity, and the nova-compute log-file on two hypervisors started logging lines similar to this:

2017-10-26 12:38:43.568 10858 WARNING nova.compute.manager [req-
6c999e79-afe7-4294-b537-fa6a13f2a791 - - - - -] While synchronizing
instance power states, found 7 instances in the database and 6 instances
on the hypervisor.

We also have icinga-probes set up to list instances that are running on
the host, but not marked as such in the database, and the other way
around - and they went off.


Environment
===========
1. The RDO packages, on Ocata:
openstack-nova-compute-15.0.7-1.el7.noarch
python2-novaclient-7.1.2-1.el7.noarch
openstack-nova-common-15.0.7-1.el7.noarch
python-nova-15.0.7-1.el7.noarch


2. Libvirt KVM on CentOS 7.4 (running a 7.3 kernel due to an unrelated bug with vxlan -- 3.10.0-514.10.2.el7.x86_64)

qemu-kvm-ev-2.9.0-16.el7_4.5.1.x86_64
libvirt-daemon-3.2.0-14.el7_4.3.x86_64


2. We used Cinder with the Dell EMC Unity driver on Fiber channel.
The cinder-node runs the RDO packages, on Ocata:
openstack-cinder-10.0.5-1.el7.noarch


3. Neutron, vxlan/linux-bridge


I expect to follow up on this bug with more information when we can
reproduce the problem in a more controlled fashion.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1727845

Title:
  Database and compute nodes can fall out of sync after live migration

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  When performing a rolling live-migrate-and-reboot of a series of compute hosts, we have seen some (rare) occurrances where the instance-record in the database have not been updated with the new host.  Another symptom is that the affected instance will be without network connectivity.

  A workaround when this occurs is to:

  1) Manually set the correct host for the instance in the database.
  2) Perform a live-migration of the instance again.

  After a successful live-migration, the instance will again regain
  network connectivity.

  Steps to reproduce
  ==================
  This is a rare occurrance, and in my experience happens 1/100 times or so.

  1) Disable hypervisor
  2) Evacuate hypervisor by live migrating all instances from it
  3) Reboot
  4) Enable hypervisor
  5) Repeat on next hypervisor

  
  Expected result
  ===============
  No problems should have occurred.

  Actual result
  =============
  A single instance remained without network connectivity, and the nova-compute log-file on two hypervisors started logging lines similar to this:

  2017-10-26 12:38:43.568 10858 WARNING nova.compute.manager [req-
  6c999e79-afe7-4294-b537-fa6a13f2a791 - - - - -] While synchronizing
  instance power states, found 7 instances in the database and 6
  instances on the hypervisor.

  We also have icinga-probes set up to list instances that are running
  on the host, but not marked as such in the database, and the other way
  around - and they went off.

  
  Environment
  ===========
  1. The RDO packages, on Ocata:
  openstack-nova-compute-15.0.7-1.el7.noarch
  python2-novaclient-7.1.2-1.el7.noarch
  openstack-nova-common-15.0.7-1.el7.noarch
  python-nova-15.0.7-1.el7.noarch

  
  2. Libvirt KVM on CentOS 7.4 (running a 7.3 kernel due to an unrelated bug with vxlan -- 3.10.0-514.10.2.el7.x86_64)

  qemu-kvm-ev-2.9.0-16.el7_4.5.1.x86_64
  libvirt-daemon-3.2.0-14.el7_4.3.x86_64

  
  2. We used Cinder with the Dell EMC Unity driver on Fiber channel.
  The cinder-node runs the RDO packages, on Ocata:
  openstack-cinder-10.0.5-1.el7.noarch

  
  3. Neutron, vxlan/linux-bridge


  
  I expect to follow up on this bug with more information when we can reproduce the problem in a more controlled fashion.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1727845/+subscriptions
Follow ups

[Bug 1727845] Re: Database and compute nodes can fall out of sync after live migration
From: Launchpad Bug Tracker, 2020-11-22