yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2059768] [NEW] glance hangs when rbd pool in read only

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Guillaume Boutry <2059768@xxxxxxxxxxxxxxxxxx>
Date: Fri, 29 Mar 2024 10:25:42 -0000
Reply-to: Bug 2059768 <2059768@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

When the ceph pool backing glance is full (goes into read only), Glance
IO calls never respond, and the worker taking care of the API call is
basically a zombie.

If enough IO requests are made, for example 4 when you have 4 workers,
glance will not be able to respond to any kind of requests. You need to
restart glance to have responses again.

ceph status:
  cluster:
    id:     ce9a32e4-9768-457a-b811-225b710aeb58
    health: HEALTH_ERR
            3 full osd(s)
            3 pool(s) full
            1 pool(s) have no replicas configured

  services:
    mon: 1 daemons, quorum bm0.lxd (age 2h)
    mgr: bm0.lxd(active, since 2h)
    osd: 3 osds: 3 up (since 2h), 3 in (since 2h)

  data:
    pools:   3 pools, 161 pgs
    objects: 6.92k objects, 47 GiB
    usage:   143 GiB used, 6.8 GiB / 150 GiB avail
    pgs:     161 active+clean

ceph osd dump | grep ratio:
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

Here's a response from the apache 2 http proxying for glance:

openstack image delete 0a582014-832a-4f2a-9944-4111812fe6b2
Failed to delete image with name or ID '0a582014-832a-4f2a-9944-4111812fe6b2': HttpException: 502: Server Error for url: http://10.206.54.243:80/openstack-glance/v2/images/0a582014-832a-4f2a-9944-4111812fe6b2, The proxy server could not handle the requestReason: Error reading from remote server: 502 Proxy Error: Proxy Error: Apache/2.4.52 (Ubuntu) Server at 10.206.54.243 Port 9292: The proxy server received an invalid: response from an upstream server.
Failed to delete 1 of 1 images.

The last log for these requests at debug level is:

DEBUG glance_store.location [None req-4cdf1de9-fbe2-49a8-92d4-db0902773af2 e7cc50bfcb1246479c5b9397048377fe d0c1adff192b40e9989460336bab7c8c - - e152fb5db324433ba53d8ead347c6802 e15
2fb5db324433ba53d8ead347c6802] Registering scheme rbd with {'ceph': {'store': <glance_store._drivers.rbd.Store object at 0x7ff3c5d1b820>, 'location_class': <class 'glance_store._drivers.rbd.StoreLocation'>, 'store_entry': 'rbd'}} register_scheme_bac
kend_map /usr/lib/python3/dist-packages/glance_store/location.py:132

To fix this, I adjusted the full_ratio to allow writing again, and
deleted images. But glance should have a mechanism to detect this / a
timeout.

Versions:
glance 27.0.0
ceph 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)

** Affects: glance
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/2059768

Title:
  glance hangs when rbd pool in read only

Status in Glance:
  New

Bug description:
  When the ceph pool backing glance is full (goes into read only),
  Glance IO calls never respond, and the worker taking care of the API
  call is basically a zombie.

  If enough IO requests are made, for example 4 when you have 4 workers,
  glance will not be able to respond to any kind of requests. You need
  to restart glance to have responses again.

  ceph status:
    cluster:
      id:     ce9a32e4-9768-457a-b811-225b710aeb58
      health: HEALTH_ERR
              3 full osd(s)
              3 pool(s) full
              1 pool(s) have no replicas configured

    services:
      mon: 1 daemons, quorum bm0.lxd (age 2h)
      mgr: bm0.lxd(active, since 2h)
      osd: 3 osds: 3 up (since 2h), 3 in (since 2h)

    data:
      pools:   3 pools, 161 pgs
      objects: 6.92k objects, 47 GiB
      usage:   143 GiB used, 6.8 GiB / 150 GiB avail
      pgs:     161 active+clean

  ceph osd dump | grep ratio:
  full_ratio 0.95
  backfillfull_ratio 0.9
  nearfull_ratio 0.85

  Here's a response from the apache 2 http proxying for glance:

  openstack image delete 0a582014-832a-4f2a-9944-4111812fe6b2
  Failed to delete image with name or ID '0a582014-832a-4f2a-9944-4111812fe6b2': HttpException: 502: Server Error for url: http://10.206.54.243:80/openstack-glance/v2/images/0a582014-832a-4f2a-9944-4111812fe6b2, The proxy server could not handle the requestReason: Error reading from remote server: 502 Proxy Error: Proxy Error: Apache/2.4.52 (Ubuntu) Server at 10.206.54.243 Port 9292: The proxy server received an invalid: response from an upstream server.
  Failed to delete 1 of 1 images.

  The last log for these requests at debug level is:

  DEBUG glance_store.location [None req-4cdf1de9-fbe2-49a8-92d4-db0902773af2 e7cc50bfcb1246479c5b9397048377fe d0c1adff192b40e9989460336bab7c8c - - e152fb5db324433ba53d8ead347c6802 e15
  2fb5db324433ba53d8ead347c6802] Registering scheme rbd with {'ceph': {'store': <glance_store._drivers.rbd.Store object at 0x7ff3c5d1b820>, 'location_class': <class 'glance_store._drivers.rbd.StoreLocation'>, 'store_entry': 'rbd'}} register_scheme_bac
  kend_map /usr/lib/python3/dist-packages/glance_store/location.py:132

  To fix this, I adjusted the full_ratio to allow writing again, and
  deleted images. But glance should have a mechanism to detect this / a
  timeout.

  Versions:
  glance 27.0.0
  ceph 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)

To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/2059768/+subscriptions