← Back to team overview

kernel-packages team mailing list archive

[Bug 1587686] Re: ZFS: Running ztest repeatedly for long periods of time eventually results in "zdb: can't open 'ztest': No such file or directory"


Thanks for looking into this, I'll test that build tonight but I assume
I'll see similar results.

In my previous tests with this commit applied I still occasionally ran
into the error (albeit far less often, sometimes not at all, or rather
quickly) and also some other traces regarding pthreads (current
0.6.5-release sort of incorrectly uses pthreads and ASSERTs and other
things at the moment from what I understand, and there's a lot of
upstream work being done on it in master). lowlatency kernels seem to
fail faster on it too which is a bit confusing. I still think there's a
lot of corner case bugs in ZFS and ztest. Fully fixing ztest/ZFS/SPL for
0.5.6-release would likely be way too invasive to backport, and it looks
like bandaids such as this only prolong the inevitable failure.

After much cherry picking, trial & error, and commenting on some upsteam
commits I don't believe ztest was intended for end users or as a
reliable long-term stress tool - nor does it get as much developer
attention for releases since it's not a real-world test. One upstream
developer/maintainer even commented that ztest is intended for ZFS
developers (implying end users shouldn't be using it?) - which makes me
question why it's even included in zfsutils-linux if it's fundamentally
broken on release versions. If it's this unreliable then it will create
many more false positives for others looking to test the stability of
Ubuntu's ZFS, resulting in people thikning ZFS or Ubuntu's ZFS
implementation is broken when in fact it may be perfectly fine under
real world workloads.

ztest still works as a short term test for ZFS functions though and this
commit probably did belong in release (they've marked it for
milestone) but as mentioned above there's many other outstanding issues
this tool brings to light (whether falsely positive or not).

On a side note, I'd be interested in seeing ZFS ran under AFL (AFL
Filesystem fuzzing, a tool which recently discovered many upstream bugs
in existing kernel filesystems) since many corner case bugs were found
in current filesystems with fixes incoming for backport to 4.4.13,
4.5.7, and 4.6.2 however LinuxFoundation's Oracle AFL event/presentation
only included the most commonly used in-kernel filesystems.

Sorry if this is noise, but hopefully this will bring more awareness to
this issue which may not even be an issue, the correct fix may be to
move ztest to another (dev or debug?) package.

You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.

  ZFS: Running ztest repeatedly for long periods of time eventually
  results in "zdb: can't open 'ztest': No such file or directory"

Status in Native ZFS for Linux:
Status in linux package in Ubuntu:
Status in zfs-linux package in Ubuntu:
  In Progress

Bug description:
  Problem: Running ztest repeatedly for long periods of time eventually
  results in "zdb: can't open 'ztest': No such file or directory"

  This bug affects the xenial kernel built-in ZFS as well as the package
  zfs-dkms. I don't believe ZFS 0.6.3-stable or 0.6.4-release are
  effected, 0.6.5-release seems to have included the offending commit.
  Sorry for excessive "Affects" tagging, I'm still new to this and
  unsure of the proper packages to report this against and/or how to
  properly add the upstream issues/commits.

  Upstream bug report: https://github.com/zfsonlinux/zfs/issues/4129
  "ztest can occasionally fail because zdb cannot locate the pool after several hours of run time. This appears to be caused be an empty cache file."

  How to reproduce: run ztest repeatedly such as a command like this and it will eventually fail:
  ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z*
  (I have /tmp mounted on tmpfs with a 10G limit but I don't believe this is related in any way, and I've confirmed it's not running out of space)

  Upstream fix: https://github.com/zfsonlinux/zfs/commit/151f84e2c32f690b92c424d8c55d2dfccaa76e51
  Description: Fix ztest truncated cache file
  "Commit efc412b updated spa_config_write() for Linux 4.2 kernels to
  truncate and overwrite rather than rename the cache file.  This is
  the correct fix but it should have only been applied for the kernel
  build.  In user space rename(2) is needed because ztest depends on
  the cache file."
  Associated pull request for above commit: https://github.com/zfsonlinux/zfs/pull/4130

  I'm not sure why this wasn't backported to release but it's in zfs
  master. I've Reproduced this bug on xenial kernels 4.4.0-22-generic,
  4.4.0-23-generic, 4.4.0-22-lowlatency, and 4.4.0-23-lowlatency as well
  as various xenial master-next builds. After applying the above commit
  patch to kernel and building/installing kernel manually, ztest runs
  fine. I've also separately tested the commit patch on zfs-dkms package
  which also appears to fix the issue. Note however, there may still be
  some other outstanding ztest related issues upstream - especially when
  preempt and hires timers are used. I'm currently testing more heavily
  against lowlatency builds and master-next.

  (I'm unsure how to associate this bug with multiple packages but zfs-
  dkms and linux-image-* packages both are affected).

  P.S. Also of note is
  "Fix inverted logic on none elevator comparison" - which interestingly
  was signed-off-by canonical but curiously not included in the xenial
  kernel or zfs-dkms packages. It was however, backported to
  0.6.5-release upstream.

To manage notifications about this bug go to: