← Back to team overview

group.of.nepali.translators team mailing list archive

[Bug 1628750] Re: Please backport fixes from 10.2.3 and tip for RadosGW

 

This bug was fixed in the package ceph - 10.2.3-0ubuntu0.16.04.2

---------------
ceph (10.2.3-0ubuntu0.16.04.2) xenial; urgency=medium

  * rgw: Fixes for creation times for buckets (LP: #1587261):
    - d/p/rgw_rados-creation_time.patch: Backport fix from upstream master.
      Fix logic error that leads to creation time being 0 instead of current
      time when creating buckets.

ceph (10.2.3-0ubuntu0.16.04.1) xenial; urgency=medium

  * New upstream stable release (LP: #1628809).
    - d/p/*: Refresh.
    - d/p/rocksdb-flags.patch: Dropped, accepted upstream.
    - d/p/32bit-ftbfs.patch: Cherry pick fix for 32bit arch compat.
    - d/ceph-{fs-common,fuse}.install: Fix install locations
      for mount{.fuse}.ceph.
  * Limit the amount of data per chunk in omap push operations to 64k,
    ensuring that OSD threads don't hit timeouts during recovery
    operations (LP: #1628750):
    - d/p/osd-limit-omap-data-in-push-op.patch: Cherry pick fix from
      upstream master branch.

 -- Frode Nordahl <frode.nordahl@xxxxxxxxxxxxx>  Fri, 28 Oct 2016
13:50:40 +0200

** Changed in: ceph (Ubuntu Xenial)
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of नेपाली
भाषा समायोजकहरुको समूह, which is subscribed to Xenial.
Matching subscriptions: Ubuntu 16.04 Bugs
https://bugs.launchpad.net/bugs/1628750

Title:
  Please backport fixes from 10.2.3 and tip for RadosGW

Status in Ubuntu Cloud Archive:
  Invalid
Status in Ubuntu Cloud Archive mitaka series:
  Fix Committed
Status in ceph package in Ubuntu:
  Fix Released
Status in ceph source package in Xenial:
  Fix Released
Status in ceph source package in Yakkety:
  Fix Released

Bug description:
  [Impact] 
  In ceph deployments with large numbers of objects (typically generated by use of radosgw for object storage), during recovery options when servers or disks fail, it quite possible for OSD recovering data to hit their suicide timeout and shutdown because of the number of objects each was trying to recover in a single chuck between heartbeats.  As a result, clusters go read-only due to data availability.

  [Test Case]
  Non-trivial to reproduce - see original bug report.

  [Regression Potential] 
  Medium; the fix for this problem is to reduce the number of operations per chunk to 64000, limiting the chance that an OSD will not heatbeat and suicide itself as a result.  This is configurable so can be tuned on a per environment basis. 

  The patch has been accepted into the Ceph master branch, but is not
  currently targetted as a stable fix for Jewel.

  >> Original Bug Report <<

  We've run into significant issues with RadosGW at scale; we have a
  customer who has ½ billion objects in ~20Tb of data and whenever they
  lose an OSD for whatever reason, even for a very short period of time,
  ceph was taking hours and hours to recover.  The whole time it was
  recovering requests to RadosGW were hanging.

  I ended up cherry picking 3 patches; 2 from 10.2.3 and one from trunk:

    * d/p/fix-pg-temp.patch: cherry pick 56bbcb1aa11a2beb951de396b0de9e3373d91c57 from jewel.
    * d/p/only-update-up_thru-if-newer.patch: 6554d462059b68ab983c0c8355c465e98ca45440 from jewel.
    * d/p/limit-omap-data-in-push-op.patch: 38609de1ec5281602d925d20c392ba4094fdf9d3 from master.

  The 2 from 10.2.3 are because pg_temp was implicated in one of the
  longer outages we had. 

  The last one is what I think actually got us to a point where ceph was
  stable and I found it via the following URL chain:

  http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2016-June/010230.html
  -> http://tracker.ceph.com/issues/16128
  -> https://github.com/ceph/ceph/pull/9894
  -> https://github.com/ceph/ceph/commit/38609de1ec5281602d925d20c392ba4094fdf9d3

  With these 3 patches applied the customer has been stable for 4 days
  now but I've yet to restart the entire cluster (only the stuck OSDs)
  so it's hard to be completely sure that all our issues are resolved
  but also which of the patches fixed things.

  I've attached the debdiff I used for reference.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1628750/+subscriptions