group.of.nepali.translators team mailing list archive
-
group.of.nepali.translators team
-
Mailing list archive
-
Message #35166
[Bug 1819437] Re: transient mon<->osd connectivity HEALTH_WARN events don't self clear in 13.2.4
** Changed in: ceph (Ubuntu Eoan)
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1819437
Title:
transient mon<->osd connectivity HEALTH_WARN events don't self clear
in 13.2.4
Status in ceph package in Ubuntu:
Fix Released
Status in ceph source package in Xenial:
Invalid
Status in ceph source package in Bionic:
In Progress
Status in ceph source package in Eoan:
Fix Released
Status in ceph source package in Focal:
Fix Released
Bug description:
In a recently juju deployed 13.2.4 ceph cluster (as part of an
OpenStack Rocky deploy) we experienced a none clearing HEALTH_WARN
event that appeared to be associated with a short planned network
outage, but did not clear without human intervention:
health: HEALTH_WARN
6 slow ops, oldest one blocked for 112899 sec, daemons [mon.shinx,mon.sliggoo] have slow ops.
We can correlate this back to a known network event, but all OSDs are
up and the cluster otherwise looks healthy:
ubuntu@juju-df624b-4-lxd-14:~$ sudo ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 7.64076 root default
-13 0.90970 host happiny
8 hdd 0.90970 osd.8 up 1.00000 1.00000
-5 0.90970 host jynx
9 hdd 0.90970 osd.9 up 1.00000 1.00000
-3 1.63739 host piplup
0 hdd 0.81870 osd.0 up 1.00000 1.00000
3 hdd 0.81870 osd.3 up 1.00000 1.00000
-9 1.63739 host raichu
5 hdd 0.81870 osd.5 up 1.00000 1.00000
6 hdd 0.81870 osd.6 up 1.00000 1.00000
-11 0.90919 host shinx
7 hdd 0.90919 osd.7 up 1.00000 1.00000
-7 1.63739 host sliggoo
1 hdd 0.81870 osd.1 up 1.00000 1.00000
4 hdd 0.81870 osd.4 up 1.00000 1.00000
ubuntu@shinx:~$ sudo ceph daemon mon.shinx ops
{
"ops": [
{
"description": "osd_failure(failed timeout osd.0 10.48.2.158:6804/211414 for 31sec e911 v911)",
"initiated_at": "2019-03-07 00:40:43.282823",
"age": 113953.696205,
"duration": 113953.696225,
"type_data": {
"events": [
{
"time": "2019-03-07 00:40:43.282823",
"event": "initiated"
},
{
"time": "2019-03-07 00:40:43.282823",
"event": "header_read"
},
{
"time": "0.000000",
"event": "throttled"
},
{
"time": "0.000000",
"event": "all_read"
},
{
"time": "0.000000",
"event": "dispatched"
},
{
"time": "2019-03-07 00:40:43.283360",
"event": "mon:_ms_dispatch"
},
{
"time": "2019-03-07 00:40:43.283360",
"event": "mon:dispatch_op"
},
{
"time": "2019-03-07 00:40:43.283360",
"event": "psvc:dispatch"
},
{
"time": "2019-03-07 00:40:43.283370",
"event": "osdmap:preprocess_query"
},
{
"time": "2019-03-07 00:40:43.283371",
"event": "osdmap:preprocess_failure"
},
{
"time": "2019-03-07 00:40:43.283386",
"event": "osdmap:prepare_update"
},
{
"time": "2019-03-07 00:40:43.283386",
"event": "osdmap:prepare_failure"
}
],
"info": {
"seq": 48576937,
"src_is_mon": false,
"source": "osd.8 10.48.2.206:6800/1226277",
"forwarded_to_leader": false
}
}
},
{
"description": "osd_failure(failed timeout osd.3 10.48.2.158:6800/211410 for 31sec e911 v911)",
"initiated_at": "2019-03-07 00:40:43.282997",
"age": 113953.696032,
"duration": 113953.696127,
"type_data": {
"events": [
{
"time": "2019-03-07 00:40:43.282997",
"event": "initiated"
},
{
"time": "2019-03-07 00:40:43.282997",
"event": "header_read"
},
{
"time": "0.000000",
"event": "throttled"
},
{
"time": "0.000000",
"event": "all_read"
},
{
"time": "0.000000",
"event": "dispatched"
},
{
"time": "2019-03-07 00:40:43.284394",
"event": "mon:_ms_dispatch"
},
{
"time": "2019-03-07 00:40:43.284395",
"event": "mon:dispatch_op"
},
{
"time": "2019-03-07 00:40:43.284395",
"event": "psvc:dispatch"
},
{
"time": "2019-03-07 00:40:43.284402",
"event": "osdmap:preprocess_query"
},
{
"time": "2019-03-07 00:40:43.284403",
"event": "osdmap:preprocess_failure"
},
{
"time": "2019-03-07 00:40:43.284416",
"event": "osdmap:prepare_update"
},
{
"time": "2019-03-07 00:40:43.284417",
"event": "osdmap:prepare_failure"
}
],
"info": {
"seq": 48576958,
"src_is_mon": false,
"source": "osd.8 10.48.2.206:6800/1226277",
"forwarded_to_leader": false
}
}
},
{
"description": "osd_failure(failed timeout osd.7 10.48.2.157:6800/650064 for 1sec e916 v916)",
"initiated_at": "2019-03-07 00:41:08.839840",
"age": 113928.139188,
"duration": 113928.139359,
"type_data": {
"events": [
{
"time": "2019-03-07 00:41:08.839840",
"event": "initiated"
},
{
"time": "2019-03-07 00:41:08.839840",
"event": "header_read"
},
{
"time": "0.000000",
"event": "throttled"
},
{
"time": "0.000000",
"event": "all_read"
},
{
"time": "0.000000",
"event": "dispatched"
},
{
"time": "2019-03-07 00:41:08.840040",
"event": "mon:_ms_dispatch"
},
{
"time": "2019-03-07 00:41:08.840040",
"event": "mon:dispatch_op"
},
{
"time": "2019-03-07 00:41:08.840040",
"event": "psvc:dispatch"
},
{
"time": "2019-03-07 00:41:08.840058",
"event": "osdmap:preprocess_query"
},
{
"time": "2019-03-07 00:41:08.840060",
"event": "osdmap:preprocess_failure"
},
{
"time": "2019-03-07 00:41:08.840080",
"event": "osdmap:prepare_update"
},
{
"time": "2019-03-07 00:41:08.840081",
"event": "osdmap:prepare_failure"
}
],
"info": {
"seq": 48578207,
"src_is_mon": false,
"source": "osd.6 10.48.2.161:6800/499396",
"forwarded_to_leader": false
}
}
}
],
"num_ops": 3
}
This looks remarkably like:
https://tracker.ceph.com/issues/24531
I restarted the 2 affected mons in turn, HEALTH OK and issue did not
reoccur.
Expected behaviour: ceph health should recover from temporary network
event without user interaction.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1819437/+subscriptions