kernel-packages team mailing list archive

Thread
Date
[Bug 1371591] Re: file not initialized to 0s under some conditions on VMWare

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Arvind Kumar <arvindkumar@xxxxxxxxxx>
Date: Fri, 10 Oct 2014 19:30:48 -0000
Reply-to: Bug 1371591 <1371591@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Hi Chris,

This is Arvind Kumar from VMware. Recently the issue discussed in this
bug was brought into VMware's notice. We looked at the patch
(https://lkml.org/lkml/2014/9/23/509) which was done to address the
issue. Since the patch is done in mptsas driver, it addresses the issue
only on lsilogic controller, if user uses some other controller e.g.
pvscsi or buslogic then the issue remains. Moreover the patch disables
the WRITE SAME completely on the lsilogic which indicates that VMware
will never be able to support WRITE SAME on lsilogic. As I understand
from the bug, it is concluded that the WRITE SAME is not properly
implemented by VMware. Actually we don't support WRITE SAME at all.

We internally investigated the issue and as per our understanding the
issue is not VMware specific and rather seems to be with the kernel,
which could very well happen on real hardware too in case the disk
doesn't support WRITE SAME command. Below are the details of the
investigation by Petr Vandrovec.

--

In blk-lib.c on line 294 it checks whether bdev supports write_same.
With LVM, bdev here is dm-0. It says yes, it is supported, and so
write_same is invoked (note that check is racy in case device loses
write_same capability between test and moment bio is issued):

    291  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
    292                           sector_t nr_sects, gfp_t gfp_mask)
    293  {
    294          if (bdev_write_same(bdev)) {
    295                  unsigned char bdn[BDEVNAME_SIZE];
    296
    297                  if (!blkdev_issue_write_same(bdev, sector, nr_sects, gfp_mask,
    298                                               ZERO_PAGE(0)))
    299                          return 0;
    300
    301                  bdevname(bdev, bdn);
    302                  pr_err("%s: WRITE SAME failed. Manually zeroing.\n", bdn);
    303          }
    304
    305          return __blkdev_issue_zeroout(bdev, sector, nr_sects, gfp_mask);
    306  }
    307  EXPORT_SYMBOL(blkdev_issue_zeroout);

Then it gets to LVM, and LVM forwards request to sda. When it fails,
kernel clears bdev_write_same() on sda, and returns -121 (EREMOTEIO).

Now next request comes. Nobody cleared bdev_write_same() on dm-0, it got
cleared only on sda, so request gets to LVM, which forwards it to sda.
Where it hits a snag in blk-core.c:

   1824          if (bio->bi_rw & REQ_WRITE_SAME && !bdev_write_same(bio->bi_bdev)) {
   1825                  err = -EOPNOTSUPP;
   1826                  goto end_io;
   1827          }

bi_bdev here is sda, and I/O fails with EOPNOTSUPP, without WRITE_SAME
ever being issued. And then it hits completion code that treats
EOPNOTSUPP as success:

     18  static void bio_batch_end_io(struct bio *bio, int err)
     19  {
     20          struct bio_batch *bb = bio->bi_private;
     21
     22          if (err && (err != -EOPNOTSUPP))
     23                  clear_bit(BIO_UPTODATE, &bb->flags);
     24          if (atomic_dec_and_test(&bb->done))
     25                  complete(bb->wait);
     26          bio_put(bio);
     27  }

So everybody outside of blkdev_issue_write_same() thinks that I/O
succeeded, while in reality kernel even did not issue request!

Fix should:

1. Use different error code if WRITE_SAME request is thrown away. Or
remove special EOPNOTSUPP handling from end_io - I assume EOPNOTSUPP is
supposed to ignore failures from discarded commands, but then nobody
else should be using EOPNOTSUPP, and

2. WRITE_SAME failure should propagate from sda to dm-0.

--

Our understanding is that we should revert the fix in mptsas driver and
try to do the right fix as described above. I am attaching the patch
from Petr who did the investigation. CC'ing all involved people from
VMware too. Could you please evaluate the patch and suggest on further
steps?

Thanks!
Arvind

** Patch added: "0001-Do-not-silently-discard-WRITE_SAME-requests.patch"
   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1371591/+attachment/4230953/+files/0001-Do-not-silently-discard-WRITE_SAME-requests.patch

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1371591

Title:
  file not initialized to 0s under some conditions on VMWare

Status in “linux” package in Ubuntu:
  Fix Released
Status in “linux” source package in Trusty:
  Fix Committed

Bug description:
  Under some conditions, after fallocate() the file is observed not to
  be completely initilized to 0s: some 4KB pages have left-over data
  from previous files that occupied those pages. Note that in addition
  to causing functional problems for applications expecting files to be
  initialized to 0s, this is a security issue because it allows data to
  "leak" from one file to another, bypassing file access controls.

  The problem has been seen running under the following VMWare-based virtual environments:
  Fusion 6.0.2
  ESXi 5.1.0

  And under the following versions of Ubuntu:
  Ubuntu 12.04, 3.11.0-26-generic
  Ubuntu 14.04.1, 3.13.0-32-generic
  Ubuntu 14.04.1, 3.13.0-35-generic

  But did not reproduce under the following version:
  Ubuntu 10.04, 2.6.32-38-server

  The problem reproduced under LVM, but did not reproduce without LVM.

  I reproduced the problem as follows under VMWare Fusion:
  set up custom VM with default disk size (20 GB) and memory size (1 GB)
  attach Ubuntu 14.04.1 ISO to CDROM, set it as boot device, boot up
  select all defaults during installation _including_ LVM
  install gcc
  unpack the attached repro.tgz
  run repro.sh

  what it does:
  * fills the disk with a file containing bytes of 0xcc then deletes it
  * repeatedly runs the repro program which creates two files and accesses them in a certain pattern
  * checks the file f0 with hexdump; it should contain all 0s, but if pages 0x1000-0x7000 contain 0xcc you have reproduced the problem

  If the problem does not appear to reproduce, please try waiting a bit
  and checking the f0 files with hexdump again. This behavior was
  observed by a customer reproducing the problem under ESXi. I since
  added an sync after the running the repro binary which I think will
  fix that.

  If you still can't reproduce the problem please let me know if there's
  anything I can do to help. For example can we trace the disk accesses
  at the SCSI level to verify whether the appropriate SCSI commands are
  being sent? This may help determine whether the problem is in Linux or
  in VMWare.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1371591/+subscriptions
References

[Bug 1371591] [NEW] FS Corruption with Ubuntu and VMWare
From: Leann Ogasawara, 2014-09-19