← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2046332] Re: When the host is restarted, nova-compute startup gets stuck

 

[Expired for OpenStack Compute (nova) because there has been no activity
for 60 days.]

** Changed in: nova
       Status: Incomplete => Expired

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2046332

Title:
  When the host is restarted, nova-compute startup gets stuck

Status in OpenStack Compute (nova):
  Expired

Bug description:
  Description:
  After a host machine reboot, the nova-compute service fails to start up properly. Upon investigation, it was found that the nova-compute service gets stuck during the startup process, specifically during the createWithFlags call. This results in the nova-compute service remaining in a non-operational state.

  Environment:
  OpenStack version: Wallaby
  libvirt version: 6.0.0
  QEMU version: 4.2.0
  Backend storage: iSCSI
  Instance using iSCSI commercial storage

  Observations:
  The nova-compute.log shows that during the startup process, the createWithFlags call keeps blocking, preventing the nova-compute service from completing its startup routine.
  The libvirt.log indicates multiple warnings and errors related to the qemuDomainObjBeginJobInternal function, resulting in the inability to acquire state change locks.

  Expected Behavior:
  The nova-compute service should successfully start up after a host machine reboot, allowing it to manage instances without getting stuck during the startup process.

  Steps to Reproduce:
  Reboot the host machine.
  Constructing the createWithFlags interface has been blocking and not returning to the client because of communication exceptions with qemu.
  Monitor the nova-compute service startup process.

  Impact:
  The non-operational state of the nova-compute service affects the overall availability of instances and their management.

  Additional Information:
  The nova-compute service, during startup, initiates a series of operations including defineXML, createWithFlags, and lookupByUUIDString in order to bring up instances. However, the createWithFlags call remains stuck, leading to the service being unable to complete the startup process.

  Logs:
  nova-compute.log
  2023-12-04 14:13:43.582 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:43.582449438+08:00 stdout F 2023-12-04 14:13:43.582 32029 INFO os_brick.initiator.linuxscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Scanning host 9 c: -, t: -, l: 10)^[[00m
  2023-12-04 14:13:43.664 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:43.66419754+08:00 stdout F 2023-12-04 14:13:43.663 32029 INFO os_brick.initiator.linuxscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Searching for a device in session 2 and hctl ('3', '0', '0', 10) yield: sdn^[[00m
  2023-12-04 14:13:43.668 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:43.668415615+08:00 stdout F 2023-12-04 14:13:43.668 32029 INFO os_brick.initiator.linuxscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Searching for a device in session 3 and hctl ['4', '*', '*', 10] yield: sdo^[[00m
  2023-12-04 14:13:43.741 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:43.74138962+08:00 stdout F 2023-12-04 14:13:43.740 32029 INFO os_brick.initiator.linuxscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Searching for a device in session 1 and hctl ('2', '0', '0', 10) yield: sdp^[[00m
  2023-12-04 14:13:43.743 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:43.743081142+08:00 stdout F 2023-12-04 14:13:43.742 32029 INFO os_brick.initiator.linuxscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Searching for a device in session 6 and hctl ['7', '*', '*', 10] yield: sdq^[[00m
  2023-12-04 14:13:43.744 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:43.744370001+08:00 stdout F 2023-12-04 14:13:43.744 32029 INFO os_brick.initiator.linuxscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Searching for a device in session 8 and hctl ['9', '*', '*', 10] yield: sdr^[[00m
  2023-12-04 14:13:43.744 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:43.744783448+08:00 stdout F 2023-12-04 14:13:43.744 32029 INFO os_brick.initiator.linuxscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Searching for a device in session 5 and hctl ('6', '0', '0', 10) yield: sdt^[[00m
  2023-12-04 14:13:43.745 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:43.745437018+08:00 stdout F 2023-12-04 14:13:43.745 32029 INFO os_brick.initiator.linuxscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Searching for a device in session 7 and hctl ['8', '*', '*', 10] yield: sds^[[00m
  2023-12-04 14:13:44.226 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:44.226460766+08:00 stdout F 2023-12-04 14:13:44.225 32029 INFO os_brick.initiator.linuxscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Searching for a device in session 4 and hctl ['5', '*', '*', 10] yield: sdu^[[00m
  2023-12-04 14:13:44.482 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:44.482754517+08:00 stdout F 2023-12-04 14:13:44.482 32029 INFO os_brick.initiator.connectors.iscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] All connection threads finished, giving 10 seconds for dm to appear.^[[00m
  2023-12-04 14:13:45.484 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:45.484445914+08:00 stdout F 2023-12-04 14:13:45.483 32029 INFO os_brick.initiator.connectors.iscsi [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Result for connection is {'type': 'block', 'scsi_wwn': '36001405acbd43e79b732d343028329ed', 'path': '/dev/dm-3', 'multipath_id': '36001405acbd43e79b732d343028329ed'}.^[[00m
  2023-12-04 14:13:45.514 +0800 ¦ node-12 ¦ nova-compute-2plv4 ¦ nova-compute ¦ 2023-12-04T14:13:45.514589463+08:00 stdout F 2023-12-04 14:13:45.514 32029 INFO os_vif [req-717433ee-a291-45ca-81ee-05150953b496 - - - - -] Successfully plugged vif VIFOpenVSwitch(active=True,address=fa:16:3e:e2:3c:03,bridge_name='br-int',has_traffic_filtering=True,id=ff32db4b-0a4a-4f89-bc17-eb261d55ef67,network=Network(d8955418-a63d-4f22-86cc-037e0cd7a291),plugin='ovs',port_profile=VIFPortProfileOpenVSwitch,preserve_on_delete=False,vif_name='tapff32db4b-0a')^[[00m
  2023-12-04 15:09:28.512 +0800 ¦ node-12 ¦ nova-compute-g7pf8 ¦ nova-compute ¦ 2023-12-04T15:09:28.51286769+08:00 stderr F ++ hostname -s
  2023-12-04 15:09:28.513 +0800 ¦ node-12 ¦ nova-compute-g7pf8 ¦ nova-compute ¦ 2023-12-04T15:09:28.5139163+08:00 stderr F + host_k8s=node-12
  2023-12-04 15:09:28.514 +0800 ¦ node-12 ¦ nova-compute-g7pf8 ¦ nova-compute ¦ 2023-12-04T15:09:28.514304455+08:00 stderr F + grep hostha=enabled

  libvirt.log
  2023-12-04 14:13:35.707 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:13:35.707262365+08:00 stderr F 2023-12-04 06:13:35.707+0000: 14541: info : hostname: node-12.domain.tld
  2023-12-04 14:13:35.707 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:13:35.707265536+08:00 stderr F 2023-12-04 06:13:35.707+0000: 14541: warning : qemuDomainDefValidate:5966 : CPU topology doesn't match numa CPU count; partial NUMA mapping is obsoleted and will be removed in future
  2023-12-04 14:13:35.709 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:13:35.709217034+08:00 stderr F 2023-12-04 06:13:35.709+0000: 14540: warning : qemuDomainDefValidate:5966 : CPU topology doesn't match numa CPU count; partial NUMA mapping is obsoleted and will be removed in future
  2023-12-04 14:13:45.528 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:13:45.528828631+08:00 stderr F 2023-12-04 06:13:45.528+0000: 14543: warning : qemuDomainDefValidate:5966 : CPU topology doesn't match numa CPU count; partial NUMA mapping is obsoleted and will be removed in future
  2023-12-04 14:13:45.531 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:13:45.531482963+08:00 stderr F 2023-12-04 06:13:45.531+0000: 14544: warning : qemuDomainDefValidate:5966 : CPU topology doesn't match numa CPU count; partial NUMA mapping is obsoleted and will be removed in future
  2023-12-04 14:18:44.472 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:18:44.472148283+08:00 stderr F 2023-12-04 06:18:44.472+0000: 14543: warning : qemuDomainObjBeginJobInternal:9905 : Cannot start job (query, none, none) for domain instance-0000019c; current job is (async nested, none, start) owned by (14544 remoteDispatchDomainCreateWithFlags, 0 <null>, 14544 remoteDispatchDomainCreateWithFlags (flags=0x0)) for (298s, 0s, 298s)
  2023-12-04 14:18:44.472 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:18:44.472176398+08:00 stderr F 2023-12-04 06:18:44.472+0000: 14543: error : qemuDomainObjBeginJobInternal:9927 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreateWithFlags)
  2023-12-04 14:19:14.490 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:19:14.490144202+08:00 stderr F 2023-12-04 06:19:14.490+0000: 14540: warning : qemuDomainObjBeginJobInternal:9905 : Cannot start job (query, none, none) for domain instance-0000019c; current job is (async nested, none, start) owned by (14544 remoteDispatchDomainCreateWithFlags, 0 <null>, 14544 remoteDispatchDomainCreateWithFlags (flags=0x0)) for (328s, 0s, 328s)
  2023-12-04 14:19:14.490 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:19:14.490163367+08:00 stderr F 2023-12-04 06:19:14.490+0000: 14540: error : qemuDomainObjBeginJobInternal:9927 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreateWithFlags)
  2023-12-04 14:19:44.542 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:19:44.542135682+08:00 stderr F 2023-12-04 06:19:44.542+0000: 14542: warning : qemuDomainObjBeginJobInternal:9905 : Cannot start job (query, none, none) for domain instance-0000019c; current job is (async nested, none, start) owned by (14544 remoteDispatchDomainCreateWithFlags, 0 <null>, 14544 remoteDispatchDomainCreateWithFlags (flags=0x0)) for (358s, 0s, 359s)
  2023-12-04 14:19:44.542 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:19:44.542158518+08:00 stderr F 2023-12-04 06:19:44.542+0000: 14542: error : qemuDomainObjBeginJobInternal:9927 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreateWithFlags)
  2023-12-04 14:20:14.590 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:20:14.590167336+08:00 stderr F 2023-12-04 06:20:14.590+0000: 14541: warning : qemuDomainObjBeginJobInternal:9905 : Cannot start job (query, none, none) for domain instance-0000019c; current job is (async nested, none, start) owned by (14544 remoteDispatchDomainCreateWithFlags, 0 <null>, 14544 remoteDispatchDomainCreateWithFlags (flags=0x0)) for (388s, 0s, 389s)
  2023-12-04 14:20:14.590 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:20:14.590188481+08:00 stderr F 2023-12-04 06:20:14.590+0000: 14541: error : qemuDomainObjBeginJobInternal:9927 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreateWithFlags)
  2023-12-04 14:20:44.628 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:20:44.62812621+08:00 stderr F 2023-12-04 06:20:44.628+0000: 14543: warning : qemuDomainObjBeginJobInternal:9905 : Cannot start job (query, none, none) for domain instance-0000019c; current job is (async nested, none, start) owned by (14544 remoteDispatchDomainCreateWithFlags, 0 <null>, 14544 remoteDispatchDomainCreateWithFlags (flags=0x0)) for (419s, 0s, 419s)
  2023-12-04 14:20:44.628 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T14:20:44.628149564+08:00 stderr F 2023-12-04 06:20:44.628+0000: 14543: error : qemuDomainObjBeginJobInternal:9927 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreateWithFlags)
  2023-12-04 15:09:01.199 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T15:09:01.199162919+08:00 stderr F 2023-12-04 07:09:01.199+0000: 14543: warning : qemuDomainObjBeginJobInternal:9905 : Cannot start job (query, none, none) for domain instance-0000019c; current job is (async nested, none, start) owned by (14544 remoteDispatchDomainCreateWithFlags, 0 <null>, 14544 remoteDispatchDomainCreateWithFlags (flags=0x0)) for (3315s, 0s, 3315s)
  ......
  2023-12-04 15:09:01.199 +0800 ¦ node-12 ¦ libvirt-lbr6f ¦ libvirt ¦ 2023-12-04T15:09:01.199191359+08:00 stderr F 2023-12-04 07:09:01.199+0000: 14543: error : qemuDomainObjBeginJobInternal:9927 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreateWithFlags)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2046332/+subscriptions



References