[Bug 1587686] Re: ZFS: Running ztest repeatedly for long periods of time eventually results in "zdb: can't open 'ztest': No such file or directory"


Appreciate you looking into this, I was only able to test your builds
for about 5 hours on generic kernel version so far (doing some hardware
upgrades at the moment, but my current test system is torture-test

My test hardware was 2x (westmere) Intel Xeon E5620's (2 NUMA nodes)
with 12GB (2GBx6) ECC RDIMMs on each CPU (24GB total) on ubuntu-server
16.04. ztest was ran on default /tmp however I had /tmp mounted on tmpfs
with 10G limit, but from what I could tell it was not exceeding that

I believe this issue becomes more apparent in 4.4.11 and 4.4.12 (and
possibly 4.4.13 now) for some reason since those were failing for me
within a few hours with this "fix" applied, whereas latest stable I
compiled  with fix seemed okay. I think there's some race conditions of
some sort with newer kernels, especially since I saw different results
on the lowlatency kernel awhile back (on the same stable release).

I'll do some more testing if I have some time, and I want to test this
on some other distros as well but I think the fix might not work on
future kernel releases that integrate 4.4.11, 4.4.12, and 4.4.13 since
some of the patches may have changed some core functions which uncovered
ZFS bugs again.

It's still possible it somehow only effects my hardware/OS only. Unless
I was compiling the kernel strangely, I was doing a git clone from
master-next, checking out latest stable (detached head) and
applying/commiting the patch. My 4.4.11 and 4.4.12 builds were were
manually applied cleanly from upstream on top of xenial master-next
(neither were merged into master-next at the time), so that could also
have been a possible issue - there was a few redundant patches I skipped
that were already in master-next though.

However, the bug still stands on stock stable xenial kernel - and this
patch seems to fix it (at least on generic, still unsure about

Compiling debian/ubuntu kernels from git is pretty complicated though with conflicting documentation. I was using this command after checking out and appluing patch:
fakeroot debian/rules clean
fakeroot debian/rules updateconfigs
fakeroot debian/rules binary-headers binary-generic binary-perarch
(or binary-lowlatency for lowlatency builds)
I'm not using cloud-tools packages.

Anyways I guess you can close this and it can be reopened if I have time
to attempt to reproduce the bug. it's not a critical patch but it's
queued for 0.6.5-release upstream so there's probably no harm including
it in ubuntu kernel.


  ZFS: Running ztest repeatedly for long periods of time eventually
  results in "zdb: can't open 'ztest': No such file or directory"

Status in Native ZFS for Linux:
Status in linux package in Ubuntu:
Status in zfs-linux package in Ubuntu:
  In Progress

Bug description:
  Problem: Running ztest repeatedly for long periods of time eventually
  results in "zdb: can't open 'ztest': No such file or directory"

  This bug affects the xenial kernel built-in ZFS as well as the package
  zfs-dkms. I don't believe ZFS 0.6.3-stable or 0.6.4-release are
  effected, 0.6.5-release seems to have included the offending commit.
  Sorry for excessive "Affects" tagging, I'm still new to this and
  unsure of the proper packages to report this against and/or how to
  properly add the upstream issues/commits.

  Upstream bug report: https://github.com/zfsonlinux/zfs/issues/4129
  "ztest can occasionally fail because zdb cannot locate the pool after several hours of run time. This appears to be caused be an empty cache file."

  How to reproduce: run ztest repeatedly such as a command like this and it will eventually fail:
  ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z* && sleep 3 && ztest -T 3600 && rm /tmp/z*
  (I have /tmp mounted on tmpfs with a 10G limit but I don't believe this is related in any way, and I've confirmed it's not running out of space)

  Upstream fix: https://github.com/zfsonlinux/zfs/commit/151f84e2c32f690b92c424d8c55d2dfccaa76e51
  Description: Fix ztest truncated cache file
  "Commit efc412b updated spa_config_write() for Linux 4.2 kernels to
  truncate and overwrite rather than rename the cache file.  This is
  the correct fix but it should have only been applied for the kernel
  build.  In user space rename(2) is needed because ztest depends on
  the cache file."
  Associated pull request for above commit: https://github.com/zfsonlinux/zfs/pull/4130

  I'm not sure why this wasn't backported to release but it's in zfs
  master. I've Reproduced this bug on xenial kernels 4.4.0-22-generic,
  4.4.0-23-generic, 4.4.0-22-lowlatency, and 4.4.0-23-lowlatency as well
  as various xenial master-next builds. After applying the above commit
  patch to kernel and building/installing kernel manually, ztest runs
  fine. I've also separately tested the commit patch on zfs-dkms package
  which also appears to fix the issue. Note however, there may still be
  some other outstanding ztest related issues upstream - especially when
  preempt and hires timers are used. I'm currently testing more heavily
  against lowlatency builds and master-next.

  (I'm unsure how to associate this bug with multiple packages but zfs-
  dkms and linux-image-* packages both are affected).

  P.S. Also of note is
  "Fix inverted logic on none elevator comparison" - which interestingly
  was signed-off-by canonical but curiously not included in the xenial
  kernel or zfs-dkms packages. It was however, backported to
  0.6.5-release upstream.

