yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1800204] Re: n-cpu.service consuming 100% of CPU indeterminately

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Artom Lifshitz <notartom@xxxxxxxxx>
Date: Thu, 23 Apr 2020 18:51:18 -0000
Reply-to: Bug 1800204 <1800204@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

I'm going to say the same thing as bug 1801733 - this is super nifty and
interesting, but realistically is not a concern and will most likely
never get addressed.

** Changed in: nova
       Status: New => Won't Fix

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1800204

Title:
  n-cpu.service consuming 100% of CPU indeterminately

Status in OpenStack Compute (nova):
  Won't Fix

Bug description:
  Description
  ==============
  I used fault injection to assess the robustness of the nova-conductor, and by injecting a specific sequence of failures I saw a failure that can threaten the robustness of the system. The resulting of applying these faults in the interface of nova-conductor prevent the nova-compute provisioning new instances.

  Steps to reproduce
  =====================
  I reproduced this bug 100% from 10 attempts. I used devstack/queens.

  The workload I used is of the following steps:
  1) First, create a VM with the following flavor: 64MB RAM, 1 VCPU, 0 DISK; and the reference image 'cirros.0.3.4' for instance; all other settings can be the defaults of admin account;
  2) Rebuild with an alternative image: for instance, 'cirros 0.4.0';
  3) Rebuild with the reference image again;
  4) Shelve the instance;
  5) Delete the instance;

  Below, I describe the faultload. For each time a fault is injected, the workload is executed from its begin. The steps are:
  1) Intercept the first RPC message (i.e. AMQP) that calls for 'schedule_and_build_instances';
  2) Inject the 'fault' in 'schedule_and_build_instances.args.build_requests->'nova_object.data'.instance.'nova_object.data'.flavor.'nova_object.data'.vcpus'

  The pseudo-algorithm:
  1. execute workload
  2. for each fault in ['2', '-10000000000000000000001', '10000000000000000000000']
  2.1.   execute workload in parallel with faultload(fault)
  3. see the CPU activity for the process n-cpu.service of devstack

  Expected result
  ==================
  nova-compute handles the faults not impacting in future requests.

  Actual result
  ================
  nova-compute consumes 100% of CPU and new instances is set to 'error' state without any clue about the issue, so it is not possible to create new instances without restarting n-cpu.service

  Environment
  ==============
  Devstack/Queens in Single Machine with defaults.

  Logs & Configs
  =================
  Logs attached.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1800204/+subscriptions

References

[Bug 1800204] [NEW] n-cpu.service consuming 100% of CPU indeterminately
From: Wallace Cardoso, 2018-10-26