← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1936720] [NEW] new instance gets stuck indefinitely at build state with task_state none

 

Public bug reported:

Description
===========
nova-compute service is up but does not work.

new instances which get scheduled on that compute node will stuck at build state with task_state none,
and it doesn't go to ERROR state even after it reaches intance build timeout threshold.

(openstack) server show 9299bee1-633d-4233-9f2b-9a7d1871d51b
+-------------------------------------+------------------------------------------------+
| Field                               | Value                                          |
+-------------------------------------+------------------------------------------------+
| OS-DCF:diskConfig                   | AUTO                                           |
| OS-EXT-AZ:availability_zone         | nova                                           |
| OS-EXT-SRV-ATTR:host                | None                                           |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None                                           |
| OS-EXT-SRV-ATTR:instance_name       | instance-0000bfb6                              |
| OS-EXT-STS:power_state              | NOSTATE                                        |
| OS-EXT-STS:task_state               | None                                           |
| OS-EXT-STS:vm_state                 | building                                       |
| OS-SRV-USG:launched_at              | None                                           |
| OS-SRV-USG:terminated_at            | None                                           |
| accessIPv4                          |                                                |
| accessIPv6                          |                                                |
| addresses                           |                                                |
| config_drive                        |                                                |
| created                             | 2021-07-17T11:49:35Z                           |
| flavor                              | i1.mini (75253a8f-eb7c-4473-9874-884a01a524a7) |
| hostId                              |                                                |
| id                                  | 9299bee1-633d-4233-9f2b-9a7d1871d51b           |
| image                               |                                                |
| key_name                            | Sia-KP                                         |
| name                                | qwerty-17                                      |
| progress                            | 0                                              |
| project_id                          | c4a93f6c1c194bf78bd98ee0f4d51978               |
| properties                          |                                                |
| status                              | BUILD                                          |
| updated                             | 2021-07-17T11:49:41Z                           |
| user_id                             | 042131e0784b46218521eee7963022bf               |
| volumes_attached                    |                                                |
+-------------------------------------+------------------------------------------------+


I have two OpenStack setups (staging and production). this issue happens on both of them but randomly on 
different compute nodes. both setups are stable/ussuri release and deployed using openstack-ansible.

there were no error in nova logs, I enabled debug on nova services, it cought my eye that on the corrupted
compute node, the logs got stopped sometime before this problem occurs.

compute service list, while this issue happens. (CP-12 is the corrupted
compute node)

(openstack) compute service list
+-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+
|  ID | Binary         | Host                                  | Zone     | Status  | State | Updated At                 |
+-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+
|   7 | nova-conductor | SHN-CN-61-nova-api-container-b11ef08e | internal | enabled | up    | 2021-07-17T14:23:45.000000 |
|  34 | nova-scheduler | SHN-CN-61-nova-api-container-b11ef08e | internal | enabled | up    | 2021-07-17T14:23:43.000000 |
|  85 | nova-conductor | SHN-CN-63-nova-api-container-e4f37374 | internal | enabled | up    | 2021-07-17T14:23:41.000000 |
|  91 | nova-conductor | SHN-CN-62-nova-api-container-71ffd912 | internal | enabled | up    | 2021-07-17T14:23:45.000000 |
| 109 | nova-scheduler | SHN-CN-63-nova-api-container-e4f37374 | internal | enabled | up    | 2021-07-17T14:23:41.000000 |
| 157 | nova-scheduler | SHN-CN-62-nova-api-container-71ffd912 | internal | enabled | up    | 2021-07-17T14:23:45.000000 |
| 199 | nova-compute   | SHN-CP-72                             | nova     | enabled | up    | 2021-07-17T14:23:41.000000 |
.
.
.
| 232 | nova-compute   | SHN-CP-18                             | nova     | enabled | up    | 2021-07-17T14:23:41.000000 |
| 235 | nova-compute   | SHN-CP-12                             | nova     | enabled | up    | 2021-07-17T14:23:41.000000 |
| 238 | nova-compute   | SHN-CP-20                             | nova     | enabled | up    | 2021-07-17T14:23:41.000000 |
| 241 | nova-compute   | SHN-CP-22                             | nova     | enabled | up    | 2021-07-17T14:23:41.000000 |
+-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+


restarting nova-compute will resolve the issue until it happens again.

Steps to reproduce
==================
- not always but sometimes this happens.
- Create multiple instances for higher probability of happenng this issue.

Expected result
===============
either nova-compute service goes to down state, or instance goes to ERROR state, or any warning or error in nova logs.

Actual result
=============
instances which schedule on the corrupted compute node(which BTW happens randomly) will stuck indefinitely at BUILD state
and task_state None

Environment
===========
OSA deployment of stable/ussuri on ubuntu, with install_method=source.
this problem happend after I seperated RPC rabbitmq cluster from notify rabbitmq cluster.(not sure if this is related, but
thats when it started happening)
also it worth mentioning that this issue happens on both of my setups.

Logs & Configs
==============
this is the log before nova-compute service stops logging:
https://paste.opendev.org/show/807547/

this is nova-compute log when the instance get scheduled on the node:

# journalctl -u nova-compute.service --since '2021-07-17 11:49:00' --until '2021-07-17 12:00:00' --no-pager
-- Logs begin at Mon 2021-05-31 04:36:00 UTC, end at Sat 2021-07-17 16:23:38 UTC. --
Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.086 3857910 DEBUG oslo_concurrency.lockutils [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] Lock "6a148ea3-6793-4e26-acb2-9dd1214a666d" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.000s inner /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/oslo_concurrency/lockutils.py:354
Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.105 3857910 DEBUG nova.compute.manager [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] [instance: 6a148ea3-6793-4e26-acb2-9dd1214a666d] Starting instance... _do_build_and_run_instance /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/nova/compute/manager.py:2173
Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.711 3857910 DEBUG oslo_concurrency.lockutils [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] Lock "9299bee1-633d-4233-9f2b-9a7d1871d51b" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.000s inner /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/oslo_concurrency/lockutils.py:354
Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.731 3857910 DEBUG nova.compute.manager [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] [instance: 9299bee1-633d-4233-9f2b-9a7d1871d51b] Starting instance... _do_build_and_run_instance /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/nova/compute/manager.py:2173

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1936720

Title:
  new instance gets stuck indefinitely at build state with task_state
  none

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  nova-compute service is up but does not work.

  new instances which get scheduled on that compute node will stuck at build state with task_state none,
  and it doesn't go to ERROR state even after it reaches intance build timeout threshold.

  (openstack) server show 9299bee1-633d-4233-9f2b-9a7d1871d51b
  +-------------------------------------+------------------------------------------------+
  | Field                               | Value                                          |
  +-------------------------------------+------------------------------------------------+
  | OS-DCF:diskConfig                   | AUTO                                           |
  | OS-EXT-AZ:availability_zone         | nova                                           |
  | OS-EXT-SRV-ATTR:host                | None                                           |
  | OS-EXT-SRV-ATTR:hypervisor_hostname | None                                           |
  | OS-EXT-SRV-ATTR:instance_name       | instance-0000bfb6                              |
  | OS-EXT-STS:power_state              | NOSTATE                                        |
  | OS-EXT-STS:task_state               | None                                           |
  | OS-EXT-STS:vm_state                 | building                                       |
  | OS-SRV-USG:launched_at              | None                                           |
  | OS-SRV-USG:terminated_at            | None                                           |
  | accessIPv4                          |                                                |
  | accessIPv6                          |                                                |
  | addresses                           |                                                |
  | config_drive                        |                                                |
  | created                             | 2021-07-17T11:49:35Z                           |
  | flavor                              | i1.mini (75253a8f-eb7c-4473-9874-884a01a524a7) |
  | hostId                              |                                                |
  | id                                  | 9299bee1-633d-4233-9f2b-9a7d1871d51b           |
  | image                               |                                                |
  | key_name                            | Sia-KP                                         |
  | name                                | qwerty-17                                      |
  | progress                            | 0                                              |
  | project_id                          | c4a93f6c1c194bf78bd98ee0f4d51978               |
  | properties                          |                                                |
  | status                              | BUILD                                          |
  | updated                             | 2021-07-17T11:49:41Z                           |
  | user_id                             | 042131e0784b46218521eee7963022bf               |
  | volumes_attached                    |                                                |
  +-------------------------------------+------------------------------------------------+

  
  I have two OpenStack setups (staging and production). this issue happens on both of them but randomly on 
  different compute nodes. both setups are stable/ussuri release and deployed using openstack-ansible.

  there were no error in nova logs, I enabled debug on nova services, it cought my eye that on the corrupted
  compute node, the logs got stopped sometime before this problem occurs.

  compute service list, while this issue happens. (CP-12 is the
  corrupted compute node)

  (openstack) compute service list
  +-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+
  |  ID | Binary         | Host                                  | Zone     | Status  | State | Updated At                 |
  +-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+
  |   7 | nova-conductor | SHN-CN-61-nova-api-container-b11ef08e | internal | enabled | up    | 2021-07-17T14:23:45.000000 |
  |  34 | nova-scheduler | SHN-CN-61-nova-api-container-b11ef08e | internal | enabled | up    | 2021-07-17T14:23:43.000000 |
  |  85 | nova-conductor | SHN-CN-63-nova-api-container-e4f37374 | internal | enabled | up    | 2021-07-17T14:23:41.000000 |
  |  91 | nova-conductor | SHN-CN-62-nova-api-container-71ffd912 | internal | enabled | up    | 2021-07-17T14:23:45.000000 |
  | 109 | nova-scheduler | SHN-CN-63-nova-api-container-e4f37374 | internal | enabled | up    | 2021-07-17T14:23:41.000000 |
  | 157 | nova-scheduler | SHN-CN-62-nova-api-container-71ffd912 | internal | enabled | up    | 2021-07-17T14:23:45.000000 |
  | 199 | nova-compute   | SHN-CP-72                             | nova     | enabled | up    | 2021-07-17T14:23:41.000000 |
  .
  .
  .
  | 232 | nova-compute   | SHN-CP-18                             | nova     | enabled | up    | 2021-07-17T14:23:41.000000 |
  | 235 | nova-compute   | SHN-CP-12                             | nova     | enabled | up    | 2021-07-17T14:23:41.000000 |
  | 238 | nova-compute   | SHN-CP-20                             | nova     | enabled | up    | 2021-07-17T14:23:41.000000 |
  | 241 | nova-compute   | SHN-CP-22                             | nova     | enabled | up    | 2021-07-17T14:23:41.000000 |
  +-----+----------------+---------------------------------------+----------+---------+-------+----------------------------+

  
  restarting nova-compute will resolve the issue until it happens again.

  Steps to reproduce
  ==================
  - not always but sometimes this happens.
  - Create multiple instances for higher probability of happenng this issue.

  Expected result
  ===============
  either nova-compute service goes to down state, or instance goes to ERROR state, or any warning or error in nova logs.

  Actual result
  =============
  instances which schedule on the corrupted compute node(which BTW happens randomly) will stuck indefinitely at BUILD state
  and task_state None

  Environment
  ===========
  OSA deployment of stable/ussuri on ubuntu, with install_method=source.
  this problem happend after I seperated RPC rabbitmq cluster from notify rabbitmq cluster.(not sure if this is related, but
  thats when it started happening)
  also it worth mentioning that this issue happens on both of my setups.

  Logs & Configs
  ==============
  this is the log before nova-compute service stops logging:
  https://paste.opendev.org/show/807547/

  this is nova-compute log when the instance get scheduled on the node:

  # journalctl -u nova-compute.service --since '2021-07-17 11:49:00' --until '2021-07-17 12:00:00' --no-pager
  -- Logs begin at Mon 2021-05-31 04:36:00 UTC, end at Sat 2021-07-17 16:23:38 UTC. --
  Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.086 3857910 DEBUG oslo_concurrency.lockutils [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] Lock "6a148ea3-6793-4e26-acb2-9dd1214a666d" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.000s inner /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/oslo_concurrency/lockutils.py:354
  Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.105 3857910 DEBUG nova.compute.manager [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] [instance: 6a148ea3-6793-4e26-acb2-9dd1214a666d] Starting instance... _do_build_and_run_instance /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/nova/compute/manager.py:2173
  Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.711 3857910 DEBUG oslo_concurrency.lockutils [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] Lock "9299bee1-633d-4233-9f2b-9a7d1871d51b" acquired by "nova.compute.manager.ComputeManager.build_and_run_instance.<locals>._locked_do_build_and_run_instance" :: waited 0.000s inner /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/oslo_concurrency/lockutils.py:354
  Jul 17 11:49:41 SHN-CP-12 nova-compute[3857910]: 2021-07-17 11:49:41.731 3857910 DEBUG nova.compute.manager [req-05e8f6c5-ee92-4399-8bad-1184dc45214f 042131e0784b46218521eee7963022bf c4a93f6c1c194bf78bd98ee0f4d51978 - default default] [instance: 9299bee1-633d-4233-9f2b-9a7d1871d51b] Starting instance... _do_build_and_run_instance /openstack/venvs/nova-0.1.0.dev6887/lib/python3.8/site-packages/nova/compute/manager.py:2173

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1936720/+subscriptions



Follow ups