yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2048837] [NEW] Concurrent deletion of instances leads to residual multipath

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Zhong Zhou <2048837@xxxxxxxxxxxxxxxxxx>
Date: Wed, 10 Jan 2024 06:25:31 -0000
Reply-to: Bug 2048837 <2048837@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Description
===========
A 100G **iSCSI** **shared** volume was attached to 3 instances scheduled on the same node(node-2), then I deleted these 3 instances concurrently, the 3 instances could be deleted but the output of command 'multipath -ll' shown exception as follows.

[root@node-2 ~]# multipath -ll
Jan 10 10:25:42 | sdj: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdl: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdk: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdn: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdi: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdo: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdm: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdp: prio = const (setting: emergency fallback - alua failed)
mpathaj (36001405acb21c8bbf33e1449b295c517) dm-2 ESSTOR,IBLOCK
size=100G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| |- 24:0:0:39 sdj 8:144 failed faulty running
| |- 17:0:0:39 sdl 8:176 failed faulty running
| |- 22:0:0:39 sdk 8:160 failed faulty running
| `- 19:0:0:39 sdn 8:208 failed faulty running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 23:0:0:39 sdi 8:128 failed faulty running
  |- 18:0:0:39 sdo 8:224 failed faulty running
  |- 21:0:0:39 sdm 8:192 failed faulty running
  `- 20:0:0:39 sdp 8:240 failed faulty running


Steps to reproduce
==================
1.Booting 3 instances using RBD as root disk, there is no requirement for the protocol type of the system disk in this step.
2.Creating a iSCSI shared volume as the data disk of the instance, you may using commercial storage or other storage systems using the iSCSIs protocol.
3.Attaching the shared volume to the 3 instances separately.
4.Make sure all the instances were mounted successfully, then delete the instances concurrently.

Expected result
===============
The 3 instances could be deleted completely, and no residual multipaths when execute 'multipath -ll'.

Actual result
=============
The 3 instances could be deleted, but the node had residual multipaths, as you can see the output from description above.

Environment
===========
1. Exact version of OpenStack you are running. See the following
   Wallaby Nova & Cinder, docked a commercial storage using iSCSI.

2. Which hypervisor did you use?
   Libvirt 8.0.0 + qemu-kvm 6.2.0


2. Which storage type did you use?
   Using RBD as root disk, 1 shared iSCSI volume as data-disk to 3 instances scheduled on the same node.

3. Which networking type did you use?
   omit...

Logs & Configs
==============
According to code of deleting, nova will not disconnect the shared volume from instance when the volume also attached to the other instances on the same node, then log 'Detected multiple connections on this host for volume'. node-2 nova-compute output:

2024-01-10 11:05:29.904 +0800 ¦ node-2 ¦ nova-compute-d94f6 ¦ nova-compute ¦ 2024-01-10T11:05:29.904196604+08:00 stdout F 2024-01-10 11:05:29.903 59580 INFO nova.virt.libvirt.driver [req-c9082d4c-457a-4859-a0be-c2c23953a17c fa0faf20c0e84275a5505eb6cb2673a8 793aac4869d643b19e60248715c3735b - default default] Detected multiple connections on this host for volume: f31b8fd2-1651-4667-af05-7364ac501cf9, skipping target disconnect.^[[00m
2024-01-10 11:05:30.143 +0800 ¦ node-2 ¦ nova-compute-d94f6 ¦ nova-compute ¦ 2024-01-10T11:05:30.143536178+08:00 stdout F 2024-01-10 11:05:30.143 59580 INFO nova.virt.libvirt.driver [req-065c2b2b-ae16-453f-abb7-a5756ed87f3a fa0faf20c0e84275a5505eb6cb2673a8 793aac4869d643b19e60248715c3735b - default default] Detected multiple connections on this host for volume: f31b8fd2-1651-4667-af05-7364ac501cf9, skipping target disconnect.^[[00m
2024-01-10 11:05:30.334 +0800 ¦ node-2 ¦ nova-compute-d94f6 ¦ nova-compute ¦ 2024-01-10T11:05:30.334997487+08:00 stdout F 2024-01-10 11:05:30.334 59580 INFO nova.virt.libvirt.driver [req-41afd565-599f-4b35-b4cb-acf074332079 fa0faf20c0e84275a5505eb6cb2673a8 793aac4869d643b19e60248715c3735b - default default] Detected multiple connections on this host for volume: f31b8fd2-1651-4667-af05-7364ac501cf9, skipping target disconnect.^[[00m

And oslo:
[root@node-2 ~]# multipath -ll
Jan 10 10:25:42 | sdj: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdl: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdk: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdn: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdi: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdo: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdm: prio = const (setting: emergency fallback - alua failed)
Jan 10 10:25:42 | sdp: prio = const (setting: emergency fallback - alua failed)
mpathaj (36001405acb21c8bbf33e1449b295c517) dm-2 ESSTOR,IBLOCK
size=100G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=0 status=enabled
| |- 24:0:0:39 sdj 8:144 failed faulty running
| |- 17:0:0:39 sdl 8:176 failed faulty running
| |- 22:0:0:39 sdk 8:160 failed faulty running
| `- 19:0:0:39 sdn 8:208 failed faulty running
`-+- policy='round-robin 0' prio=0 status=enabled
  |- 23:0:0:39 sdi 8:128 failed faulty running
  |- 18:0:0:39 sdo 8:224 failed faulty running
  |- 21:0:0:39 sdm 8:192 failed faulty running
  `- 20:0:0:39 sdp 8:240 failed faulty running

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2048837

Title:
  Concurrent deletion of instances leads to residual multipath

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  A 100G **iSCSI** **shared** volume was attached to 3 instances scheduled on the same node(node-2), then I deleted these 3 instances concurrently, the 3 instances could be deleted but the output of command 'multipath -ll' shown exception as follows.

  [root@node-2 ~]# multipath -ll
  Jan 10 10:25:42 | sdj: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdl: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdk: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdn: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdi: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdo: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdm: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdp: prio = const (setting: emergency fallback - alua failed)
  mpathaj (36001405acb21c8bbf33e1449b295c517) dm-2 ESSTOR,IBLOCK
  size=100G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
  |-+- policy='round-robin 0' prio=0 status=enabled
  | |- 24:0:0:39 sdj 8:144 failed faulty running
  | |- 17:0:0:39 sdl 8:176 failed faulty running
  | |- 22:0:0:39 sdk 8:160 failed faulty running
  | `- 19:0:0:39 sdn 8:208 failed faulty running
  `-+- policy='round-robin 0' prio=0 status=enabled
    |- 23:0:0:39 sdi 8:128 failed faulty running
    |- 18:0:0:39 sdo 8:224 failed faulty running
    |- 21:0:0:39 sdm 8:192 failed faulty running
    `- 20:0:0:39 sdp 8:240 failed faulty running

  
  Steps to reproduce
  ==================
  1.Booting 3 instances using RBD as root disk, there is no requirement for the protocol type of the system disk in this step.
  2.Creating a iSCSI shared volume as the data disk of the instance, you may using commercial storage or other storage systems using the iSCSIs protocol.
  3.Attaching the shared volume to the 3 instances separately.
  4.Make sure all the instances were mounted successfully, then delete the instances concurrently.

  Expected result
  ===============
  The 3 instances could be deleted completely, and no residual multipaths when execute 'multipath -ll'.

  Actual result
  =============
  The 3 instances could be deleted, but the node had residual multipaths, as you can see the output from description above.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
     Wallaby Nova & Cinder, docked a commercial storage using iSCSI.

  2. Which hypervisor did you use?
     Libvirt 8.0.0 + qemu-kvm 6.2.0

  
  2. Which storage type did you use?
     Using RBD as root disk, 1 shared iSCSI volume as data-disk to 3 instances scheduled on the same node.

  3. Which networking type did you use?
     omit...

  Logs & Configs
  ==============
  According to code of deleting, nova will not disconnect the shared volume from instance when the volume also attached to the other instances on the same node, then log 'Detected multiple connections on this host for volume'. node-2 nova-compute output:

  2024-01-10 11:05:29.904 +0800 ¦ node-2 ¦ nova-compute-d94f6 ¦ nova-compute ¦ 2024-01-10T11:05:29.904196604+08:00 stdout F 2024-01-10 11:05:29.903 59580 INFO nova.virt.libvirt.driver [req-c9082d4c-457a-4859-a0be-c2c23953a17c fa0faf20c0e84275a5505eb6cb2673a8 793aac4869d643b19e60248715c3735b - default default] Detected multiple connections on this host for volume: f31b8fd2-1651-4667-af05-7364ac501cf9, skipping target disconnect.^[[00m
  2024-01-10 11:05:30.143 +0800 ¦ node-2 ¦ nova-compute-d94f6 ¦ nova-compute ¦ 2024-01-10T11:05:30.143536178+08:00 stdout F 2024-01-10 11:05:30.143 59580 INFO nova.virt.libvirt.driver [req-065c2b2b-ae16-453f-abb7-a5756ed87f3a fa0faf20c0e84275a5505eb6cb2673a8 793aac4869d643b19e60248715c3735b - default default] Detected multiple connections on this host for volume: f31b8fd2-1651-4667-af05-7364ac501cf9, skipping target disconnect.^[[00m
  2024-01-10 11:05:30.334 +0800 ¦ node-2 ¦ nova-compute-d94f6 ¦ nova-compute ¦ 2024-01-10T11:05:30.334997487+08:00 stdout F 2024-01-10 11:05:30.334 59580 INFO nova.virt.libvirt.driver [req-41afd565-599f-4b35-b4cb-acf074332079 fa0faf20c0e84275a5505eb6cb2673a8 793aac4869d643b19e60248715c3735b - default default] Detected multiple connections on this host for volume: f31b8fd2-1651-4667-af05-7364ac501cf9, skipping target disconnect.^[[00m

  And oslo:
  [root@node-2 ~]# multipath -ll
  Jan 10 10:25:42 | sdj: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdl: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdk: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdn: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdi: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdo: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdm: prio = const (setting: emergency fallback - alua failed)
  Jan 10 10:25:42 | sdp: prio = const (setting: emergency fallback - alua failed)
  mpathaj (36001405acb21c8bbf33e1449b295c517) dm-2 ESSTOR,IBLOCK
  size=100G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
  |-+- policy='round-robin 0' prio=0 status=enabled
  | |- 24:0:0:39 sdj 8:144 failed faulty running
  | |- 17:0:0:39 sdl 8:176 failed faulty running
  | |- 22:0:0:39 sdk 8:160 failed faulty running
  | `- 19:0:0:39 sdn 8:208 failed faulty running
  `-+- policy='round-robin 0' prio=0 status=enabled
    |- 23:0:0:39 sdi 8:128 failed faulty running
    |- 18:0:0:39 sdo 8:224 failed faulty running
    |- 21:0:0:39 sdm 8:192 failed faulty running
    `- 20:0:0:39 sdp 8:240 failed faulty running

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2048837/+subscriptions