← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1888895] [NEW] nova-ceph-multistore job fails often with 'No valid host was found. There are not enough hosts available.'

 

Public bug reported:

The nova-ceph-multistore job is a relatively new job name in nova which
is a version of the devstack-plugin-ceph-tempest-py3 job with some
tweaks to make it run with multiple glance stores in ceph [1].

The job has started failing recently with 'No valid host was found.
There are not enough hosts available.' errors. We discussed this today
in the #openstack-nova channel [2] and found the reason we're getting
NoValidHost is because nova-compute is reporting only 10G space
available in ceph despite the fact that our ceph volume was created with
size 24G. We're getting no allocation candidates from placement.

We traced down the source of the 10G limit to the bluestore ceph
backend. When backed by a file, ceph will create the file for the OSD if
it doesn't already exist and it will create that file with a default
size. Example of it resizing to 10G [3] today:

  2020-07-24 03:51:44.470 7f36d4689f00  1
bluestore(/var/lib/ceph/osd/ceph-0) _setup_block_symlink_or_file resized
block file to 10 GiB

When the job first began running, we were pulling ceph version tag
14.2.10 [4]:

  2020-07-23 16:10:50.781 7f1132261c00  0 ceph version
14.2.10-138-g1dfef83eeb (1dfef83eeb53147a5da8484f54fbcf46693b748f)
nautilus (stable), process ceph-osd, pid 9309

which uses a default block file size of 100G [5].

However, today, we're pulling ceph version tag 14.2.2 [6]:

  2020-07-24 03:51:44.462 7f36d4689f00  0 ceph version 14.2.2
(4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process
ceph-osd, pid 9317

which uses a default block file size of 10G [7].

So with the reduced file size we're seeing a lot of NoValidHost failures
for lack of space.

We don't yet know what caused the change in the ceph version tag we're
pulling in CI.

To address the issue, we're trying out a patch to the devstack-plugin-
ceph to set the global bluestore_block_size config option to a more
reasonable value and not rely on the default:

  https://review.opendev.org/742961

Setting this bug as Critical as the failure rate looks to be about 80%
from the most recent job runs and this job is voting:

  https://zuul.openstack.org/builds?job_name=nova-ceph-multistore

[1] https://review.opendev.org/734184
[2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2020-07-24.log.html#t2020-07-24T14:30:12
[3] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#10
[4] https://zuul.openstack.org/build/ee7cedae2c6e43908a79d89a68554649/log/controller/logs/ceph/ceph-osd.0_log.txt#2
[5] https://github.com/ceph/ceph/blob/v14.2.10/src/common/options.cc#L4445-L4448
[6] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#2
[7] https://github.com/ceph/ceph/blob/v14.2.2/src/common/options.cc#L4338-L4341

** Affects: devstack-plugin-ceph
     Importance: Undecided
         Status: In Progress

** Affects: nova
     Importance: Critical
         Status: New


** Tags: ceph gate-failure

** Also affects: devstack-plugin-ceph
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1888895

Title:
  nova-ceph-multistore job fails often with 'No valid host was found.
  There are not enough hosts available.'

Status in devstack-plugin-ceph:
  In Progress
Status in OpenStack Compute (nova):
  New

Bug description:
  The nova-ceph-multistore job is a relatively new job name in nova
  which is a version of the devstack-plugin-ceph-tempest-py3 job with
  some tweaks to make it run with multiple glance stores in ceph [1].

  The job has started failing recently with 'No valid host was found.
  There are not enough hosts available.' errors. We discussed this today
  in the #openstack-nova channel [2] and found the reason we're getting
  NoValidHost is because nova-compute is reporting only 10G space
  available in ceph despite the fact that our ceph volume was created
  with size 24G. We're getting no allocation candidates from placement.

  We traced down the source of the 10G limit to the bluestore ceph
  backend. When backed by a file, ceph will create the file for the OSD
  if it doesn't already exist and it will create that file with a
  default size. Example of it resizing to 10G [3] today:

    2020-07-24 03:51:44.470 7f36d4689f00  1
  bluestore(/var/lib/ceph/osd/ceph-0) _setup_block_symlink_or_file
  resized block file to 10 GiB

  When the job first began running, we were pulling ceph version tag
  14.2.10 [4]:

    2020-07-23 16:10:50.781 7f1132261c00  0 ceph version
  14.2.10-138-g1dfef83eeb (1dfef83eeb53147a5da8484f54fbcf46693b748f)
  nautilus (stable), process ceph-osd, pid 9309

  which uses a default block file size of 100G [5].

  However, today, we're pulling ceph version tag 14.2.2 [6]:

    2020-07-24 03:51:44.462 7f36d4689f00  0 ceph version 14.2.2
  (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process
  ceph-osd, pid 9317

  which uses a default block file size of 10G [7].

  So with the reduced file size we're seeing a lot of NoValidHost
  failures for lack of space.

  We don't yet know what caused the change in the ceph version tag we're
  pulling in CI.

  To address the issue, we're trying out a patch to the devstack-plugin-
  ceph to set the global bluestore_block_size config option to a more
  reasonable value and not rely on the default:

    https://review.opendev.org/742961

  Setting this bug as Critical as the failure rate looks to be about 80%
  from the most recent job runs and this job is voting:

    https://zuul.openstack.org/builds?job_name=nova-ceph-multistore

  [1] https://review.opendev.org/734184
  [2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2020-07-24.log.html#t2020-07-24T14:30:12
  [3] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#10
  [4] https://zuul.openstack.org/build/ee7cedae2c6e43908a79d89a68554649/log/controller/logs/ceph/ceph-osd.0_log.txt#2
  [5] https://github.com/ceph/ceph/blob/v14.2.10/src/common/options.cc#L4445-L4448
  [6] https://zuul.openstack.org/build/fad88249c7e548d3b946c6d5792fd8fe/log/controller/logs/ceph/ceph-osd.0_log.txt#2
  [7] https://github.com/ceph/ceph/blob/v14.2.2/src/common/options.cc#L4338-L4341

To manage notifications about this bug go to:
https://bugs.launchpad.net/devstack-plugin-ceph/+bug/1888895/+subscriptions


Follow ups