yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #93787
[Bug 2059768] [NEW] glance hangs when rbd pool in read only
Public bug reported:
When the ceph pool backing glance is full (goes into read only), Glance
IO calls never respond, and the worker taking care of the API call is
basically a zombie.
If enough IO requests are made, for example 4 when you have 4 workers,
glance will not be able to respond to any kind of requests. You need to
restart glance to have responses again.
ceph status:
cluster:
id: ce9a32e4-9768-457a-b811-225b710aeb58
health: HEALTH_ERR
3 full osd(s)
3 pool(s) full
1 pool(s) have no replicas configured
services:
mon: 1 daemons, quorum bm0.lxd (age 2h)
mgr: bm0.lxd(active, since 2h)
osd: 3 osds: 3 up (since 2h), 3 in (since 2h)
data:
pools: 3 pools, 161 pgs
objects: 6.92k objects, 47 GiB
usage: 143 GiB used, 6.8 GiB / 150 GiB avail
pgs: 161 active+clean
ceph osd dump | grep ratio:
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
Here's a response from the apache 2 http proxying for glance:
openstack image delete 0a582014-832a-4f2a-9944-4111812fe6b2
Failed to delete image with name or ID '0a582014-832a-4f2a-9944-4111812fe6b2': HttpException: 502: Server Error for url: http://10.206.54.243:80/openstack-glance/v2/images/0a582014-832a-4f2a-9944-4111812fe6b2, The proxy server could not handle the requestReason: Error reading from remote server: 502 Proxy Error: Proxy Error: Apache/2.4.52 (Ubuntu) Server at 10.206.54.243 Port 9292: The proxy server received an invalid: response from an upstream server.
Failed to delete 1 of 1 images.
The last log for these requests at debug level is:
DEBUG glance_store.location [None req-4cdf1de9-fbe2-49a8-92d4-db0902773af2 e7cc50bfcb1246479c5b9397048377fe d0c1adff192b40e9989460336bab7c8c - - e152fb5db324433ba53d8ead347c6802 e15
2fb5db324433ba53d8ead347c6802] Registering scheme rbd with {'ceph': {'store': <glance_store._drivers.rbd.Store object at 0x7ff3c5d1b820>, 'location_class': <class 'glance_store._drivers.rbd.StoreLocation'>, 'store_entry': 'rbd'}} register_scheme_bac
kend_map /usr/lib/python3/dist-packages/glance_store/location.py:132
To fix this, I adjusted the full_ratio to allow writing again, and
deleted images. But glance should have a mechanism to detect this / a
timeout.
Versions:
glance 27.0.0
ceph 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
** Affects: glance
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to Glance.
https://bugs.launchpad.net/bugs/2059768
Title:
glance hangs when rbd pool in read only
Status in Glance:
New
Bug description:
When the ceph pool backing glance is full (goes into read only),
Glance IO calls never respond, and the worker taking care of the API
call is basically a zombie.
If enough IO requests are made, for example 4 when you have 4 workers,
glance will not be able to respond to any kind of requests. You need
to restart glance to have responses again.
ceph status:
cluster:
id: ce9a32e4-9768-457a-b811-225b710aeb58
health: HEALTH_ERR
3 full osd(s)
3 pool(s) full
1 pool(s) have no replicas configured
services:
mon: 1 daemons, quorum bm0.lxd (age 2h)
mgr: bm0.lxd(active, since 2h)
osd: 3 osds: 3 up (since 2h), 3 in (since 2h)
data:
pools: 3 pools, 161 pgs
objects: 6.92k objects, 47 GiB
usage: 143 GiB used, 6.8 GiB / 150 GiB avail
pgs: 161 active+clean
ceph osd dump | grep ratio:
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
Here's a response from the apache 2 http proxying for glance:
openstack image delete 0a582014-832a-4f2a-9944-4111812fe6b2
Failed to delete image with name or ID '0a582014-832a-4f2a-9944-4111812fe6b2': HttpException: 502: Server Error for url: http://10.206.54.243:80/openstack-glance/v2/images/0a582014-832a-4f2a-9944-4111812fe6b2, The proxy server could not handle the requestReason: Error reading from remote server: 502 Proxy Error: Proxy Error: Apache/2.4.52 (Ubuntu) Server at 10.206.54.243 Port 9292: The proxy server received an invalid: response from an upstream server.
Failed to delete 1 of 1 images.
The last log for these requests at debug level is:
DEBUG glance_store.location [None req-4cdf1de9-fbe2-49a8-92d4-db0902773af2 e7cc50bfcb1246479c5b9397048377fe d0c1adff192b40e9989460336bab7c8c - - e152fb5db324433ba53d8ead347c6802 e15
2fb5db324433ba53d8ead347c6802] Registering scheme rbd with {'ceph': {'store': <glance_store._drivers.rbd.Store object at 0x7ff3c5d1b820>, 'location_class': <class 'glance_store._drivers.rbd.StoreLocation'>, 'store_entry': 'rbd'}} register_scheme_bac
kend_map /usr/lib/python3/dist-packages/glance_store/location.py:132
To fix this, I adjusted the full_ratio to allow writing again, and
deleted images. But glance should have a mechanism to detect this / a
timeout.
Versions:
glance 27.0.0
ceph 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
To manage notifications about this bug go to:
https://bugs.launchpad.net/glance/+bug/2059768/+subscriptions