yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1981814] [NEW] swap_volume: Maybe happen IO error or lose user data if the task failed

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Simon Li <1981814@xxxxxxxxxxxxxxxxxx>
Date: Fri, 15 Jul 2022 10:57:20 -0000
Reply-to: Bug 1981814 <1981814@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Description
===========
The task of swap_volume is a general and important function for instances and 
in-use volumes. The whole process consists of 3 steps in nova:
* first: connect new volume to libvirt guest(instance is using old volume);
* second: copy or rebase old volume data to new volume(instance is using old volume);
* third: update volumes states in cinder and block_device_mapping in nova
  (instance is using new volume);
But the exception handler is too simple: roll-back will be excuted if 
any exception happened in any step and the actual volume used was ingored.
the roll-back operation is to disconnect new volume and delete new attachment.

Clearly, a exception raised in the third step, we can't do roll-back and should
continue to complete the task if the exception is not fatal. otherwise Input/Output 
error will happen while user read or write the disk, and user data maybe lose if 
the data write to new volume but was roll-back.


Steps to reproduce
==================
1. create an instance and attach a available volume to it:
  $ openstack server create my-vm --flavor m1.medium --image <vm-image> --network <vm-network>
  $ openstack volume create my-vol --type <type-1> --size 100
  $ openstack server add volume my-vm my-vol
2. enter my-vm, make file system and mount /dev/vdc, then read-write the /dev/vdc
  $ mkfs.ext4 /dev/vdc
  $ mount /dev/vdc /mnt
  $ touch /mnt/test
  $ fio -rw=randrw -ioengine=libaio -bs=4K -size=20G -filename=/mnt/test ...
3. retype the volume:
  $ openstack volume set my-vol --type <type-2> --retype-policy on-demand
4. Some accidents cause nova disconnect old volume failed in third step after the 
  second step is finished successfully, and the task finally failed.
5. fio can't read or write file /mnt/test.

Expected result
===============
After exception happened in step 4, the disk should normally read and write.

Actual result
=============
Just as step 5, user can't read and write disk.

Environment
===========
1. nova version: 22.0.1

2. hypervisor: Libvirt+Qemu

2. Storage: ceph, FC-San, LVM

3. network: Neutron + ovs

Logs & Configs
==============

** Affects: nova
     Importance: Undecided
         Status: Confirmed

** Changed in: nova
       Status: New => Confirmed

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1981814

Title:
  swap_volume: Maybe happen IO error or lose user data if the task
  failed

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  Description
  ===========
  The task of swap_volume is a general and important function for instances and 
  in-use volumes. The whole process consists of 3 steps in nova:
  * first: connect new volume to libvirt guest(instance is using old volume);
  * second: copy or rebase old volume data to new volume(instance is using old volume);
  * third: update volumes states in cinder and block_device_mapping in nova
    (instance is using new volume);
  But the exception handler is too simple: roll-back will be excuted if 
  any exception happened in any step and the actual volume used was ingored.
  the roll-back operation is to disconnect new volume and delete new attachment.

  Clearly, a exception raised in the third step, we can't do roll-back and should
  continue to complete the task if the exception is not fatal. otherwise Input/Output 
  error will happen while user read or write the disk, and user data maybe lose if 
  the data write to new volume but was roll-back.

  
  Steps to reproduce
  ==================
  1. create an instance and attach a available volume to it:
    $ openstack server create my-vm --flavor m1.medium --image <vm-image> --network <vm-network>
    $ openstack volume create my-vol --type <type-1> --size 100
    $ openstack server add volume my-vm my-vol
  2. enter my-vm, make file system and mount /dev/vdc, then read-write the /dev/vdc
    $ mkfs.ext4 /dev/vdc
    $ mount /dev/vdc /mnt
    $ touch /mnt/test
    $ fio -rw=randrw -ioengine=libaio -bs=4K -size=20G -filename=/mnt/test ...
  3. retype the volume:
    $ openstack volume set my-vol --type <type-2> --retype-policy on-demand
  4. Some accidents cause nova disconnect old volume failed in third step after the 
    second step is finished successfully, and the task finally failed.
  5. fio can't read or write file /mnt/test.

  Expected result
  ===============
  After exception happened in step 4, the disk should normally read and write.

  Actual result
  =============
  Just as step 5, user can't read and write disk.

  Environment
  ===========
  1. nova version: 22.0.1

  2. hypervisor: Libvirt+Qemu

  2. Storage: ceph, FC-San, LVM

  3. network: Neutron + ovs

  Logs & Configs
  ==============

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1981814/+subscriptions