← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1867380] [NEW] nova-live-migration fails due to n-cpu restarting slowly after being reconfigured for ceph

 

Public bug reported:

Description
===========

$subject, it appears the current check of using grep to find active
n-cpu processes isn't enough and we actually need to wait for the
services to report as UP before starting to run Tempest.

In the following we can see Tempest starting at 2020-03-13 13:01:19.528
while n-cpu within the instance isn't marked as UP for another ~20
seconds:

https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log
/job-output.txt#6305

https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log/logs/screen-n-cpu.txt#3825

https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log/logs/subnode-2/screen-n-cpu.txt#3534

I've only seen this on stable/pike at present but it could potentially
hit all branches with slow enough CI nodes.


Steps to reproduce
==================
Run nova-live-migration on slow CI nodes.

Expected result
===============
nova/tests/live_migration/hooks/ceph.sh waits until hosts are marked as UP before running Tempest.

Actual result
=============
nova/tests/live_migration/hooks/ceph.sh checks for running n-cpu processes and then immediately starts Tempest.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   stable/pike but you be present on other branches with slow enough CI
nodes.

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt / KVM.

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Logs & Configs
==============

Mar 13 13:01:39.170201 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: 74932102-3737-4f8f-9002-763b2d580c3a] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}
Mar 13 13:01:39.255008 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: 042afab0-fbef-4506-84e2-1f54cb9d67ca] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}
Mar 13 13:01:39.322508 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: cc293f53-7428-4e66-9841-20cce219e24f] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1867380

Title:
  nova-live-migration fails due to n-cpu restarting slowly after being
  reconfigured for ceph

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  $subject, it appears the current check of using grep to find active
  n-cpu processes isn't enough and we actually need to wait for the
  services to report as UP before starting to run Tempest.

  In the following we can see Tempest starting at 2020-03-13
  13:01:19.528 while n-cpu within the instance isn't marked as UP for
  another ~20 seconds:

  https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log
  /job-output.txt#6305

  https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log/logs/screen-n-cpu.txt#3825

  https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log/logs/subnode-2/screen-n-cpu.txt#3534

  I've only seen this on stable/pike at present but it could potentially
  hit all branches with slow enough CI nodes.

  
  Steps to reproduce
  ==================
  Run nova-live-migration on slow CI nodes.

  Expected result
  ===============
  nova/tests/live_migration/hooks/ceph.sh waits until hosts are marked as UP before running Tempest.

  Actual result
  =============
  nova/tests/live_migration/hooks/ceph.sh checks for running n-cpu processes and then immediately starts Tempest.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/

     stable/pike but you be present on other branches with slow enough
  CI nodes.

  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     Libvirt / KVM.

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

     N/A

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

     N/A

  Logs & Configs
  ==============

  Mar 13 13:01:39.170201 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: 74932102-3737-4f8f-9002-763b2d580c3a] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}
  Mar 13 13:01:39.255008 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: 042afab0-fbef-4506-84e2-1f54cb9d67ca] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}
  Mar 13 13:01:39.322508 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: cc293f53-7428-4e66-9841-20cce219e24f] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1867380/+subscriptions


Follow ups