yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1800204] [NEW] n-cpu.service consuming 100% of CPU indeterminately

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Wallace Cardoso <1800204@xxxxxxxxxxxxxxxxxx>
Date: Fri, 26 Oct 2018 18:44:30 -0000
Reply-to: Bug 1800204 <1800204@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Description
==============
I used fault injection to assess the robustness of the nova-conductor, and by injecting a specific sequence of failures I saw a failure that can threaten the robustness of the system. The resulting of applying these faults in the interface of nova-conductor prevent the nova-compute provisioning new instances.

Steps to reproduce
=====================
I reproduced this bug 100% from 10 attempts. I used devstack/queens.

The workload I used is of the following steps:
1) First, create a VM with the following flavor: 64MB RAM, 1 VCPU, 0 DISK; and the reference image 'cirros.0.3.4' for instance; all other settings can be the defaults of admin account;
2) Rebuild with an alternative image: for instance, 'cirros 0.4.0';
3) Rebuild with the reference image again;
4) Shelve the instance;
5) Delete the instance;

Below, I describe the faultload. For each time a fault is injected, the workload is executed from its begin. The steps are:
1) Intercept the first RPC message (i.e. AMQP) that calls for 'schedule_and_build_instances';
2) Inject the 'fault' in 'schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.flavor.'nova_object.data'.vcpus'

The pseudo-algorithm:
1. execute workload
2. for each fault in ['2', '-10000000000000000000001', '10000000000000000000000']
2.1.   execute workload in parallel with faultload(fault)
3. see the CPU activity for the process n-cpu.service of devstack

Expected result
==================
nova-compute handles the faults not impacting in future requests.

Actual result
================
nova-compute consumes 100% of CPU and new instances is set to 'error' state without any clue about the issue, so it is not possible to create new instances without restarting n-cpu.service

Environment
==============
Devstack/Queens in Single Machine with defaults.

Logs & Configs
=================
Logs attached.

** Affects: nova
     Importance: Undecided
         Status: New

** Attachment added: "Logs from before to after applying the tests"
   https://bugs.launchpad.net/bugs/1800204/+attachment/5205956/+files/sys-100p-now.logs

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1800204

Title:
  n-cpu.service consuming 100% of CPU indeterminately

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ==============
  I used fault injection to assess the robustness of the nova-conductor, and by injecting a specific sequence of failures I saw a failure that can threaten the robustness of the system. The resulting of applying these faults in the interface of nova-conductor prevent the nova-compute provisioning new instances.

  Steps to reproduce
  =====================
  I reproduced this bug 100% from 10 attempts. I used devstack/queens.

  The workload I used is of the following steps:
  1) First, create a VM with the following flavor: 64MB RAM, 1 VCPU, 0 DISK; and the reference image 'cirros.0.3.4' for instance; all other settings can be the defaults of admin account;
  2) Rebuild with an alternative image: for instance, 'cirros 0.4.0';
  3) Rebuild with the reference image again;
  4) Shelve the instance;
  5) Delete the instance;

  Below, I describe the faultload. For each time a fault is injected, the workload is executed from its begin. The steps are:
  1) Intercept the first RPC message (i.e. AMQP) that calls for 'schedule_and_build_instances';
  2) Inject the 'fault' in 'schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.flavor.'nova_object.data'.vcpus'

  The pseudo-algorithm:
  1. execute workload
  2. for each fault in ['2', '-10000000000000000000001', '10000000000000000000000']
  2.1.   execute workload in parallel with faultload(fault)
  3. see the CPU activity for the process n-cpu.service of devstack

  Expected result
  ==================
  nova-compute handles the faults not impacting in future requests.

  Actual result
  ================
  nova-compute consumes 100% of CPU and new instances is set to 'error' state without any clue about the issue, so it is not possible to create new instances without restarting n-cpu.service

  Environment
  ==============
  Devstack/Queens in Single Machine with defaults.

  Logs & Configs
  =================
  Logs attached.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1800204/+subscriptions
Follow ups

[Bug 1800204] Re: n-cpu.service consuming 100% of CPU indeterminately
From: Artom Lifshitz, 2020-04-23