yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2020699] [NEW] Nova's rescue and unrescue assumes os-brick connect_volume is idempotent

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Gorka Eguileor <2020699@xxxxxxxxxxxxxxxxxx>
Date: Wed, 24 May 2023 18:25:34 -0000
Reply-to: Bug 2020699 <2020699@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

The rescue and unrescue operations in Nova assume that calls to
`connect_volume` in os-brick are idempotent which it's currently true,
but it was not something we guaranteed in os-brick.

With the recent CVE [1][2] we realized that os-brick cannot assume on
the `connect_volume` that if there is a device/s present for the
provided connection information then it is the right volume, and even if
it's the right volume it cannot assume that it has the right information
in sysfs (like the volume size), so it needs to clean things up to the
best of its ability before actually connecting, and just in case it
needs to confirm just before returning a patch to the caller that the
device it's going to return is actually correct and consistent (as in
the multipath only has devices with the same size and SCSI ID).

This means that os-brick's `connect_volume` will no longer be idempotent
by design once this patch [3] merges to prevent data leak in any corner
cases.

This will break the rescue and unrescue nova operations, because on the
rescue call it stashes the original XML [4] and then unstashes it on
unrescue [5], but in between Nova calls `connect_volume` for the rescue
instance, effectively disconnecting the original device path.

This means that reusing that original path either points to a non-
existent device or to a  volume of another instance.

We can see an example of the non-existent device case in the failed CI
job [6] where test
`tempest.api.compute.servers.test_server_rescue.ServerStableDeviceRescueTest.test_stable_device_rescue_disk_virtio_with_volume_attached`
fails with a nova-compute error [7]:

  libvirt.libvirtError: Cannot access storage file '/dev/sdd': No such
file or directory


[1]: https://nvd.nist.gov/vuln/detail/CVE-2023-2088

[2]: https://bugs.launchpad.net/nova/+bug/2004555

[3]: https://review.opendev.org/c/openstack/os-brick/+/882841

[4]:
https://github.com/openstack/nova/blob/71b105a4cfea054827e09b5b8df6be845909275a/nova/virt/libvirt/driver.py#L4229-L4232

[5]:
https://github.com/openstack/nova/blob/71b105a4cfea054827e09b5b8df6be845909275a/nova/virt/libvirt/driver.py#L4323-L4328

[6]: https://a30336fa6a8fca5c6dba-
fe779e5654b21fdff79727b204dfb7d6.ssl.cf1.rackcdn.com/882841/3/check/os-
brick-src-tempest-lvm-lio-barbican/8ef7adf/testr_results.html

[7]:
https://zuul.opendev.org/t/openstack/build/8ef7adf6a82248d8b9f94eb5b5bba73c/log/controller/logs/screen-
n-cpu.txt?severity=4#77239

** Affects: nova
     Importance: High
         Status: Triaged


** Tags: cinder libvirt rescue volumes

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2020699

Title:
  Nova's rescue and unrescue assumes os-brick connect_volume is
  idempotent

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  The rescue and unrescue operations in Nova assume that calls to
  `connect_volume` in os-brick are idempotent which it's currently true,
  but it was not something we guaranteed in os-brick.

  With the recent CVE [1][2] we realized that os-brick cannot assume on
  the `connect_volume` that if there is a device/s present for the
  provided connection information then it is the right volume, and even
  if it's the right volume it cannot assume that it has the right
  information in sysfs (like the volume size), so it needs to clean
  things up to the best of its ability before actually connecting, and
  just in case it needs to confirm just before returning a patch to the
  caller that the device it's going to return is actually correct and
  consistent (as in the multipath only has devices with the same size
  and SCSI ID).

  This means that os-brick's `connect_volume` will no longer be
  idempotent by design once this patch [3] merges to prevent data leak
  in any corner cases.

  This will break the rescue and unrescue nova operations, because on
  the rescue call it stashes the original XML [4] and then unstashes it
  on unrescue [5], but in between Nova calls `connect_volume` for the
  rescue instance, effectively disconnecting the original device path.

  This means that reusing that original path either points to a non-
  existent device or to a  volume of another instance.

  We can see an example of the non-existent device case in the failed CI
  job [6] where test
  `tempest.api.compute.servers.test_server_rescue.ServerStableDeviceRescueTest.test_stable_device_rescue_disk_virtio_with_volume_attached`
  fails with a nova-compute error [7]:

    libvirt.libvirtError: Cannot access storage file '/dev/sdd': No such
  file or directory


  [1]: https://nvd.nist.gov/vuln/detail/CVE-2023-2088

  [2]: https://bugs.launchpad.net/nova/+bug/2004555

  [3]: https://review.opendev.org/c/openstack/os-brick/+/882841

  [4]:
  https://github.com/openstack/nova/blob/71b105a4cfea054827e09b5b8df6be845909275a/nova/virt/libvirt/driver.py#L4229-L4232

  [5]:
  https://github.com/openstack/nova/blob/71b105a4cfea054827e09b5b8df6be845909275a/nova/virt/libvirt/driver.py#L4323-L4328

  [6]: https://a30336fa6a8fca5c6dba-
  fe779e5654b21fdff79727b204dfb7d6.ssl.cf1.rackcdn.com/882841/3/check/os-
  brick-src-tempest-lvm-lio-barbican/8ef7adf/testr_results.html

  [7]:
  https://zuul.opendev.org/t/openstack/build/8ef7adf6a82248d8b9f94eb5b5bba73c/log/controller/logs/screen-
  n-cpu.txt?severity=4#77239

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2020699/+subscriptions