← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2051244] [NEW] Documentation of Ceph auth caps for RBD clients used by Cinder / Glance / Nova is missing or inconsistent

 

Public bug reported:

This bug originates from my post to the openstack-discuss ML - https://lists.openstack.org/archives/list/openstack-discuss@xxxxxxxxxxxxxxxxxxx/thread/E3VYY24HUGBNH7626ALOGZMJRVX5VOSZ/
which was discussed at a cinder-weekly (https://meetings.opendev.org/meetings/cinder/2024/cinder.2024-01-24-14.01.log.html#l-43).

In short: There seem to be inconsistencies in the correct and required Ceph authx permissions for the RBD clients in Cinder, Glance and also Nova.
While it's nice to have the various deployment tools like openstack-ansible ([4]) or charm[[5]]) do it somewhat "properly",
first and foremost this needs to be properly documented in the source documentation of Glance and also Cinder and Nova for that matter.

And achieving this is what this bug report is intended to do.
The proposed steps are ...

 * determine and discuss the correct caps (least privileges, caps via profiles where possible, ...)
 * update the documentation / install guides and the devstack code. Those should all serve as references for the correct way of doing things.
 * write an upgrade bullet point to release notes for Caracal, to have operators check and align their caps
 * spread the word / open bugs for the deployment tools for them to update their config / code accordingly
 * send a PR to have Ceph update their docs


The long story about the various Ceph (RBD) clients and uses withing
Glance, Cinder and Nova:


1) Glance

First there was a simple issue reported for Glance [3].

When Glance is requested to delete an image it will check if this image has depended children, see https://opendev.org/openstack/glance_store/src/commit/6f5011d1f05c99894fb8b909d33ad23a20bf83a9/glance_store/_drivers/rbd.py#L459.
The children of Glance images usually are (Cinder) volumes, which therefore live in a different RBD pool "volumes". But if such children do exist a 500 error is thrown by Glance API.

Manually using the RBD client shows the same error:

> # rbd -n client.glance -k /etc/ceph/ceph.client.glance.keyring -p images children $IMAGE_ID
>
> 2023-12-13T16:51:48.131+0000 7f198cf4e640 -1 librbd::image::OpenRequest: failed to retrieve name: (1) Operation not permitted
> 2023-12-13T16:51:48.131+0000 7f198d74f640 -1 librbd::ImageState: 0x5639fdd5af60 failed to open image: (1) Operation not permitted
> rbd: listing children failed: (1) Operation not permitted
> 2023-12-13T16:51:48.131+0000 7f1990c474c0 -1 librbd::api::Image: list_descendants: failed to open descendant b7078ed7ace50d from pool instances:(1) Operation not permitted

So it's a permission error. Following either the documentation of Glance [1] or Ceph [2] on configuring the ceph auth caps there is no mention of granting anything towards the volume pool to Glance.
So this is what I currently have configured:

> client.cinder
>         key: REACTED
>         caps: [mgr] profile rbd pool=volumes, profile rbd-read-only pool=images
>         caps: [mon] profile rbd
>         caps: [osd] profile rbd pool=volumes, profile rbd-read-only pool=images
>
> client.glance
>         key: REACTED
>         caps: [mgr] profile rbd pool=images
>         caps: [mon] profile rbd
>         caps: [osd] profile rbd pool=images
>
>    client.nova
>         key: REACTED
>         caps: [mgr] profile rbd pool=instances, profile rbd pool=images
>         caps: [mon] profile rbd
>         caps: [osd] profile rbd pool=instances, profile rbd pool=images
>

When granting the glance client e.g. "rbd-read-only" to the volumes pool via:
>
> # ceph auth caps client.glance mon 'profile rbd' osd 'profile rbd pool=images, profile rbd-read-only pool=volumes' mgr 'profile rbd pool=images, profile rbd-read-only pool=volumes'
>
the error is gone.
This is the wrong approach though! Which was established during the discussion on the ML:


a) Commit [10] introduced the method "_snapshot_has_external_reference" to the yoga
release to fix [11]. The commit message also briefly states:
...

    NOTE: To check this dependency glance osd needs 'read' access to
    cinder and nova side RBD pool.
```

but there is zero mention of this requirement in the release notes for
Yoga, only for glance_store [13]. Also this (temporary, Yoga only) requirement to grant read-only rights to the volumes to Glance
was never revoked. So likely operators did miss this.

b) The mentioned method to check for snapshot references was removed again with [12], this change was also backported to the 2023.1 release.
There again was no mention of the change to operators via the release notes, who could now remove the read access for volumes from the Glance user again.

c) For none of the changes a and b there was any update to the actual
documentation on how to configure the glance user ceph caps.

d) The "_snapshot_has_external_reference" method is currently just
dangling and unused [14].

e) I am still wondering what the caps to allow reading "rbd_children" prefixed rados objects is or was used for? Especially with the managed profiles such as "rbd" or "rbd-readonly",
things should be pretty well covered.


And finally: The Glance documentation at [18] is outdated.


2) DevStack

I also wondered why there are no unit tests that fail in CI because of this [3]?
Looking at what devstack does at [6] it appears that

a) it actually applies "allow class-read object_prefix rbd_children",
which is not what is currently documented in the setup guide(s) (see [7]
and [2])

b) it unnecessarily grants read permissions to NOVA_CEPH_POOL ("vms")
and CINDER_CEPH_POOL ("volumes") also for the Glance user

c) does NOT use the managed capabilities called "profiles" such as "rbd"
or "rbd-readonly" instead of raw ACLs such das "rwx", see [9].

This also differs in the Cinder / Glance documentation and makes a great
difference as "such privileges include the ability to blocklist other
client users.", required for lock of stale RBD clients to be removed from images, see
https://docs.ceph.com/en/latest/rbd/rbd-exclusive-locks/#rbd-exclusive-locks.


This might not matter for CI / DevStack environments in itself. But since those are used to validate,
they should at best use the default / documented settings where possible to also validate they work.


3) Cinder

There seems to be no documented caps when using the ceph-rbd volume
driver [19].


4) Cinder-Backup

If cinder-backup is used with the ceph driver [17] a keyring is required allowing to create snapshots of volumes (RBD images), which then serve as source for backups.
Also deletion of those snapshots has to be allowed as cinder-backups will remove them if they are not needed anymore. While full "profile rbd" access to the volume pool works,
it's likely not required to allow e.g. cinder-backup to modify or even delete volumes. Also there could be user snapshots, which cinder-backup also does not need to be able to delete.
Then there are the caps to store and retrieve backups via rbd import / rbd import-diff from another pool (potentially on a different cluster).

There currently seems to be no caps required for cinder-backup that are
documented in e.g. [17].


4) Nova

While there are lots of RBD related options, e.g. for libvirt [8] and
more ...

 * instance storage (if `images_type=rbd``)
 * volumes
 * interaction with Glance images ([glance] -> enable_rbd_download)


But, there seems to be no list of actually required capabilities and recommendations for the various interactions with RBD.


5) OpenStack-Ansible

OpenStack-Ansible uses ceph-ansible, but they actively override the keyrings and their caps.
Overriding managed code should really just be a temporary fix (it was done for Stein if I read this correctly).
Those openstack_keys in [15], once the proper caps are defined should be converted into a PR towards ceph-ansible [16] to fix things globally there as well.

Likely there are other deployment tools, applying their home-grown set
of caps and Ceph users/keyrings as there is no references to reply on.



[1] https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
[2] https://docs.ceph.com/en/latest/rbd/rbd-openstack/#setup-ceph-client-authentication
[3] https://bugs.launchpad.net/glance/+bug/2045158
[4] Openstack-Ansible: https://opendev.org/openstack/openstack-ansible/src/branch/master/inventory/group_vars/all/ceph.yml#L53-L60
[5] Charm: https://review.opendev.org/q/topic:%22bug/1696073%22 // https://bugs.launchpad.net/charm-glance/+bug/1696073
[6] https://opendev.org/openstack/devstack-plugin-ceph/src/commit/4c22c3d0905589d676bf4865ca5cf57994eb426d/devstack/lib/ceph#L712
[7] https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
[8] https://docs.openstack.org/nova/queens/configuration/config.html#libvirt.rbd_user
[9] https://docs.ceph.com/en/latest/rados/operations/user-management/#authorization-capabilities
[10] https://github.com/openstack/glance_store/commit/3d221ec529862d43ab303644e74ee9ad6ce8cd40
[11] https://bugs.launchpad.net/glance-store/+bug/1954883
[12] https://review.opendev.org/q/I34dcd90a09d43127ff2e8b477750c70f3cc01113
[13] https://docs.openstack.org/releasenotes/glance_store/yoga.html#relnotes-3-0-0-stable-yoga
[14] https://opendev.org/openstack/glance_store/src/commit/054bd5ddf5d4d255076bd5f44296f2521e899394/glance_store/_drivers/rbd.py#L455
[15] https://opendev.org/openstack/openstack-ansible/commit/0f92985608c0f6ff941ea0445ae25eab20e94fb4
[16] https://github.com/ceph/ceph-ansible/blob/b6102975549d8f870b0c20a01edda59d6ceac422/group_vars/all.yml.sample#L642
[17] https://docs.openstack.org/cinder/latest/configuration/block-storage/backup/ceph-backup-driver.html
[18] https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
[19] https://docs.openstack.org/cinder/latest/configuration/block-storage/drivers/ceph-rbd-volume-driver.html

** Affects: cinder
     Importance: Undecided
         Status: New

** Affects: glance
     Importance: Undecided
         Status: New

** Affects: glance-store
     Importance: Undecided
         Status: New

** Affects: nova
     Importance: Undecided
         Status: New

** Also affects: glance
   Importance: Undecided
       Status: New

** Also affects: glance-store
   Importance: Undecided
       Status: New

** Also affects: nova
   Importance: Undecided
       Status: New

** Summary changed:

- Documentation of caps for Ceph auth of RBD clients used by Cinder / Glance / Nova is missing or inconsistent
+ Documentation of  Ceph auth caps for RBD clients used by Cinder / Glance / Nova is missing or inconsistent

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/2051244

Title:
  Documentation of  Ceph auth caps for RBD clients used by Cinder /
  Glance / Nova is missing or inconsistent

Status in Cinder:
  New
Status in Glance:
  New
Status in glance_store:
  New
Status in OpenStack Compute (nova):
  New

Bug description:
  This bug originates from my post to the openstack-discuss ML - https://lists.openstack.org/archives/list/openstack-discuss@xxxxxxxxxxxxxxxxxxx/thread/E3VYY24HUGBNH7626ALOGZMJRVX5VOSZ/
  which was discussed at a cinder-weekly (https://meetings.opendev.org/meetings/cinder/2024/cinder.2024-01-24-14.01.log.html#l-43).

  In short: There seem to be inconsistencies in the correct and required Ceph authx permissions for the RBD clients in Cinder, Glance and also Nova.
  While it's nice to have the various deployment tools like openstack-ansible ([4]) or charm[[5]]) do it somewhat "properly",
  first and foremost this needs to be properly documented in the source documentation of Glance and also Cinder and Nova for that matter.

  And achieving this is what this bug report is intended to do.
  The proposed steps are ...

   * determine and discuss the correct caps (least privileges, caps via profiles where possible, ...)
   * update the documentation / install guides and the devstack code. Those should all serve as references for the correct way of doing things.
   * write an upgrade bullet point to release notes for Caracal, to have operators check and align their caps
   * spread the word / open bugs for the deployment tools for them to update their config / code accordingly
   * send a PR to have Ceph update their docs


  The long story about the various Ceph (RBD) clients and uses withing
  Glance, Cinder and Nova:

  
  1) Glance

  First there was a simple issue reported for Glance [3].

  When Glance is requested to delete an image it will check if this image has depended children, see https://opendev.org/openstack/glance_store/src/commit/6f5011d1f05c99894fb8b909d33ad23a20bf83a9/glance_store/_drivers/rbd.py#L459.
  The children of Glance images usually are (Cinder) volumes, which therefore live in a different RBD pool "volumes". But if such children do exist a 500 error is thrown by Glance API.

  Manually using the RBD client shows the same error:

  > # rbd -n client.glance -k /etc/ceph/ceph.client.glance.keyring -p images children $IMAGE_ID
  >
  > 2023-12-13T16:51:48.131+0000 7f198cf4e640 -1 librbd::image::OpenRequest: failed to retrieve name: (1) Operation not permitted
  > 2023-12-13T16:51:48.131+0000 7f198d74f640 -1 librbd::ImageState: 0x5639fdd5af60 failed to open image: (1) Operation not permitted
  > rbd: listing children failed: (1) Operation not permitted
  > 2023-12-13T16:51:48.131+0000 7f1990c474c0 -1 librbd::api::Image: list_descendants: failed to open descendant b7078ed7ace50d from pool instances:(1) Operation not permitted

  So it's a permission error. Following either the documentation of Glance [1] or Ceph [2] on configuring the ceph auth caps there is no mention of granting anything towards the volume pool to Glance.
  So this is what I currently have configured:

  > client.cinder
  >         key: REACTED
  >         caps: [mgr] profile rbd pool=volumes, profile rbd-read-only pool=images
  >         caps: [mon] profile rbd
  >         caps: [osd] profile rbd pool=volumes, profile rbd-read-only pool=images
  >
  > client.glance
  >         key: REACTED
  >         caps: [mgr] profile rbd pool=images
  >         caps: [mon] profile rbd
  >         caps: [osd] profile rbd pool=images
  >
  >    client.nova
  >         key: REACTED
  >         caps: [mgr] profile rbd pool=instances, profile rbd pool=images
  >         caps: [mon] profile rbd
  >         caps: [osd] profile rbd pool=instances, profile rbd pool=images
  >

  When granting the glance client e.g. "rbd-read-only" to the volumes pool via:
  >
  > # ceph auth caps client.glance mon 'profile rbd' osd 'profile rbd pool=images, profile rbd-read-only pool=volumes' mgr 'profile rbd pool=images, profile rbd-read-only pool=volumes'
  >
  the error is gone.
  This is the wrong approach though! Which was established during the discussion on the ML:

  
  a) Commit [10] introduced the method "_snapshot_has_external_reference" to the yoga
  release to fix [11]. The commit message also briefly states:
  ...

      NOTE: To check this dependency glance osd needs 'read' access to
      cinder and nova side RBD pool.
  ```

  but there is zero mention of this requirement in the release notes for
  Yoga, only for glance_store [13]. Also this (temporary, Yoga only) requirement to grant read-only rights to the volumes to Glance
  was never revoked. So likely operators did miss this.

  b) The mentioned method to check for snapshot references was removed again with [12], this change was also backported to the 2023.1 release.
  There again was no mention of the change to operators via the release notes, who could now remove the read access for volumes from the Glance user again.

  c) For none of the changes a and b there was any update to the actual
  documentation on how to configure the glance user ceph caps.

  d) The "_snapshot_has_external_reference" method is currently just
  dangling and unused [14].

  e) I am still wondering what the caps to allow reading "rbd_children" prefixed rados objects is or was used for? Especially with the managed profiles such as "rbd" or "rbd-readonly",
  things should be pretty well covered.


  And finally: The Glance documentation at [18] is outdated.


  
  2) DevStack

  I also wondered why there are no unit tests that fail in CI because of this [3]?
  Looking at what devstack does at [6] it appears that

  a) it actually applies "allow class-read object_prefix rbd_children",
  which is not what is currently documented in the setup guide(s) (see [7]
  and [2])

  b) it unnecessarily grants read permissions to NOVA_CEPH_POOL ("vms")
  and CINDER_CEPH_POOL ("volumes") also for the Glance user

  c) does NOT use the managed capabilities called "profiles" such as "rbd"
  or "rbd-readonly" instead of raw ACLs such das "rwx", see [9].

  This also differs in the Cinder / Glance documentation and makes a great
  difference as "such privileges include the ability to blocklist other
  client users.", required for lock of stale RBD clients to be removed from images, see
  https://docs.ceph.com/en/latest/rbd/rbd-exclusive-locks/#rbd-exclusive-locks.

  
  This might not matter for CI / DevStack environments in itself. But since those are used to validate,
  they should at best use the default / documented settings where possible to also validate they work.


  3) Cinder

  There seems to be no documented caps when using the ceph-rbd volume
  driver [19].


  
  4) Cinder-Backup

  If cinder-backup is used with the ceph driver [17] a keyring is required allowing to create snapshots of volumes (RBD images), which then serve as source for backups.
  Also deletion of those snapshots has to be allowed as cinder-backups will remove them if they are not needed anymore. While full "profile rbd" access to the volume pool works,
  it's likely not required to allow e.g. cinder-backup to modify or even delete volumes. Also there could be user snapshots, which cinder-backup also does not need to be able to delete.
  Then there are the caps to store and retrieve backups via rbd import / rbd import-diff from another pool (potentially on a different cluster).

  There currently seems to be no caps required for cinder-backup that
  are documented in e.g. [17].


  4) Nova

  While there are lots of RBD related options, e.g. for libvirt [8] and
  more ...

   * instance storage (if `images_type=rbd``)
   * volumes
   * interaction with Glance images ([glance] -> enable_rbd_download)

  
  But, there seems to be no list of actually required capabilities and recommendations for the various interactions with RBD.


  5) OpenStack-Ansible

  OpenStack-Ansible uses ceph-ansible, but they actively override the keyrings and their caps.
  Overriding managed code should really just be a temporary fix (it was done for Stein if I read this correctly).
  Those openstack_keys in [15], once the proper caps are defined should be converted into a PR towards ceph-ansible [16] to fix things globally there as well.

  Likely there are other deployment tools, applying their home-grown set
  of caps and Ceph users/keyrings as there is no references to reply on.



  
  [1] https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
  [2] https://docs.ceph.com/en/latest/rbd/rbd-openstack/#setup-ceph-client-authentication
  [3] https://bugs.launchpad.net/glance/+bug/2045158
  [4] Openstack-Ansible: https://opendev.org/openstack/openstack-ansible/src/branch/master/inventory/group_vars/all/ceph.yml#L53-L60
  [5] Charm: https://review.opendev.org/q/topic:%22bug/1696073%22 // https://bugs.launchpad.net/charm-glance/+bug/1696073
  [6] https://opendev.org/openstack/devstack-plugin-ceph/src/commit/4c22c3d0905589d676bf4865ca5cf57994eb426d/devstack/lib/ceph#L712
  [7] https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
  [8] https://docs.openstack.org/nova/queens/configuration/config.html#libvirt.rbd_user
  [9] https://docs.ceph.com/en/latest/rados/operations/user-management/#authorization-capabilities
  [10] https://github.com/openstack/glance_store/commit/3d221ec529862d43ab303644e74ee9ad6ce8cd40
  [11] https://bugs.launchpad.net/glance-store/+bug/1954883
  [12] https://review.opendev.org/q/I34dcd90a09d43127ff2e8b477750c70f3cc01113
  [13] https://docs.openstack.org/releasenotes/glance_store/yoga.html#relnotes-3-0-0-stable-yoga
  [14] https://opendev.org/openstack/glance_store/src/commit/054bd5ddf5d4d255076bd5f44296f2521e899394/glance_store/_drivers/rbd.py#L455
  [15] https://opendev.org/openstack/openstack-ansible/commit/0f92985608c0f6ff941ea0445ae25eab20e94fb4
  [16] https://github.com/ceph/ceph-ansible/blob/b6102975549d8f870b0c20a01edda59d6ceac422/group_vars/all.yml.sample#L642
  [17] https://docs.openstack.org/cinder/latest/configuration/block-storage/backup/ceph-backup-driver.html
  [18] https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
  [19] https://docs.openstack.org/cinder/latest/configuration/block-storage/drivers/ceph-rbd-volume-driver.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2051244/+subscriptions



Follow ups