yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #81966
[Bug 1867380] Re: nova-live-migration and nova-grenade-multinode fail due to n-cpu restarting slowly after being reconfigured for ceph
Reviewed: https://review.opendev.org/713035
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e23c3c2c8df3843c5853c87ef684bd21c4af95d8
Submitter: Zuul
Branch: master
commit e23c3c2c8df3843c5853c87ef684bd21c4af95d8
Author: Lee Yarwood <lyarwood@xxxxxxxxxx>
Date: Fri Mar 13 16:51:01 2020 +0000
nova-live-migration: Wait for n-cpu services to come up after configuring Ceph
Previously the ceph.sh script used during the nova-live-migration job
would only grep for a `compute` process when checking if the services
had been restarted. This check was bogus and would always return 0 as it
would always match itself. For example:
2020-03-13 21:06:47.682073 | primary | 2020-03-13 21:06:47.681 | root
29529 0.0 0.0 4500 736 pts/0 S+ 21:06 0:00 /bin/sh -c ps
aux | grep compute
2020-03-13 21:06:47.683964 | primary | 2020-03-13 21:06:47.683 | root
29531 0.0 0.0 14616 944 pts/0 S+ 21:06 0:00 grep compute
Failures of this job were seen on the stable/pike branch where slower CI
nodes appeared to struggle to allow Libvirt to report to n-cpu in time
before Tempest was started. This in-turn caused instance build failures
and the overall failure of the job.
This change resolves this issue by switching to pgrep and ensuring
n-cpu services are reported as fully up after a cold restart before
starting the Tempest test run.
Closes-Bug: 1867380
Change-Id: Icd7ab2ca4ddbed92c7e883a63a23245920d961e7
** Changed in: nova
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1867380
Title:
nova-live-migration and nova-grenade-multinode fail due to n-cpu
restarting slowly after being reconfigured for ceph
Status in OpenStack Compute (nova):
Fix Released
Bug description:
Description
===========
$subject, it appears the current check of using grep to find active
n-cpu processes isn't enough and we actually need to wait for the
services to report as UP before starting to run Tempest.
In the following we can see Tempest starting at 2020-03-13
13:01:19.528 while n-cpu within the instance isn't marked as UP for
another ~20 seconds:
https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log
/job-output.txt#6305
https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log/logs/screen-n-cpu.txt#3825
https://zuul.opendev.org/t/openstack/build/5c213f869f324b69a423a983034d4539/log/logs/subnode-2/screen-n-cpu.txt#3534
I've only seen this on stable/pike at present but it could potentially
hit all branches with slow enough CI nodes.
Steps to reproduce
==================
Run nova-live-migration on slow CI nodes.
Expected result
===============
nova/tests/live_migration/hooks/ceph.sh waits until hosts are marked as UP before running Tempest.
Actual result
=============
nova/tests/live_migration/hooks/ceph.sh checks for running n-cpu processes and then immediately starts Tempest.
Environment
===========
1. Exact version of OpenStack you are running. See the following
list for all releases: http://docs.openstack.org/releases/
stable/pike but you be present on other branches with slow enough
CI nodes.
2. Which hypervisor did you use?
(For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
What's the version of that?
Libvirt / KVM.
2. Which storage type did you use?
(For example: Ceph, LVM, GPFS, ...)
What's the version of that?
N/A
3. Which networking type did you use?
(For example: nova-network, Neutron with OpenVSwitch, ...)
N/A
Logs & Configs
==============
Mar 13 13:01:39.170201 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: 74932102-3737-4f8f-9002-763b2d580c3a] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}
Mar 13 13:01:39.255008 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: 042afab0-fbef-4506-84e2-1f54cb9d67ca] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}
Mar 13 13:01:39.322508 ubuntu-xenial-rax-iad-0015199005 nova-compute[30153]: DEBUG nova.compute.manager [None req-beafe617-34df-4bec-9ff6-4a0b7bebb15f None None] [instance: cc293f53-7428-4e66-9841-20cce219e24f] Instance spawn was interrupted before instance_claim, setting instance to ERROR state {{(pid=30153) _error_out_instances_whose_build_was_interrupted /opt/stack/new/nova/nova/compute/manager.py:1323}}
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1867380/+subscriptions
References