kernel-packages team mailing list archive

Thread
Date
[Bug 1387214] Re: file corruption on touch images in rw portions of the filesystem

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Colin Ian King <1387214@xxxxxxxxxxxxxxxxxx>
Date: Thu, 06 Nov 2014 13:16:36 -0000
Reply-to: Bug 1387214 <1387214@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
After a lot of deep digging into the bind mount, loop driver, and buffer
cache and tracking the corrupt pages back down the layers of the stack
we've sanity checked this down to the image.  The smoking gun was the
kernel message:

Nov  6 12:15:16 ubuntu-phablet kernel: [    3.940485] do_mount: /dev/loop0 -> /root [<null>]
Nov  6 12:15:16 ubuntu-phablet kernel: [    3.941095] EXT2-fs (loop0): warning: mounting unchecked fs, running e2fsck is recommended
Nov  6 12:15:16 ubuntu-phablet kernel: [    3.941431] do_mount return -> 0

(apologies for my extra debug).

So it appears that /dev/loop0 is being mounted and it is corrupted.   I
ran fsck on /userdata/system.img and /userdata/ubuntu.img only to find
that the system.img needed some fixing:


fsck /userdata/system.img 
fsck from util-linux 2.25
e2fsck 1.42.10 (18-May-2014)
/userdata/system.img was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 3A: Optimizing directories
Pass 4: Checking reference counts
Unattached inode 3225
Connect to /lost+found<y>? yes
Inode 3225 ref count is 2, should be 1.  Fix<y>? yes
Unattached inode 3709
Connect to /lost+found<y>? yes
Inode 3709 ref count is 2, should be 1.  Fix<y>? yes
Unattached inode 3808
Connect to /lost+found<y>? yes
Inode 3808 ref count is 2, should be 1.  Fix<y>? yes
Unattached inode 4427
Connect to /lost+found<y>? yes
Inode 4427 ref count is 2, should be 1.  Fix<y>? yes
Unattached inode 4485
Connect to /lost+found<y>? yes
Inode 4485 ref count is 2, should be 1.  Fix<y>? yes
Unattached inode 5889
Connect to /lost+found<y>? yes
Inode 5889 ref count is 2, should be 1.  Fix<y>? yes
Unattached inode 5943
Connect to /lost+found<y>? yes
Inode 5943 ref count is 2, should be 1.  Fix<y>? yes
Unattached inode 7853
Connect to /lost+found<y>? yes
Inode 7853 ref count is 2, should be 1.  Fix<y>? yes
yyyPass 5: Checking group summary information
Block bitmap differences:  -70903 -71144 -71201 -(71674--71675) -71727 -71852 -72689 -72757 -(74519--74520) -74869 -74961 +(92082--92087) +(92089--92092) -92102 +92104 +92114 +y92119 +(92121--92131)
Fix<y>? yes
Free blocks count wrong for group #13 (8813, counted=8820).
Fix<y>? yes
Free blocks count wrong (133222, counted=133229).
Fix<y>? yes
Inode bitmap differences:  +(19989--20010) +(20013--20014) -(20545--20549)y -(20551--20569)
Fix<y>? yes
Free inodes count wrong for group #13 (3225, counted=3232).
Fix<y>? yes
Directories count wrong for group #13 (761, counted=760).
Fix<y>? yes
Free inodes count wrong (81946, counted=81953).
Fix<y>? yes

/userdata/system.img: ***** FILE SYSTEM WAS MODIFIED *****
/userdata/system.img: ***** REBOOT LINUX *****


So, there are two big issues outstanding, most probably in the user space shutdown and initrd stages:

1. The file system is not being flushed and unmounted properly.
2. The file system is not being fsck'd before mounting - this is a cardinal sin IMHO

The end result is mounting a corrupt file system that is causing the
garbage in the apparmor files.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1387214

Title:
  file corruption on touch images in rw portions of the filesystem

Status in “linux” package in Ubuntu:
  Confirmed

Bug description:
  Symptoms are that cache files in /var/cache/apparmor and profiles in
  /var/lib/apparmor/profiles are sometimes corrupted after a reboot.
  We've already fixed several bugs in the apparmor and click-apparmor
  and made both more robust in the face of corruption and we've reduced
  the impact when there is a corrupted profile, but we've still not
  found the cause of the corruption. This corruption can still affect
  real-world devices: if a profile in /var/lib/apparmor/profiles is
  corrupted and the cache file is out of date, then the profile won't
  compile and that app/scope won't start.

  Workaround: remove the affected profile and then run 'sudo aa-
  clickhook'. This obviously is not viable on an end-user device.

  The investigation is ongoing and this may not be a problem with the
  kernel at all, so this bug may be retargeted to another project.

  The security team and the kernel team have discussed this a lot and
  Colin is currently looking at this. This bug is just so it can be
  tracked. Here is an excerpt from my latest email to Colin:

  "I believe I have conclusively ruled out apparmor_parser and aa-
  clickhook by creating a new 'home/bug/test-with-true.sh'. Here is the
  test output:

  http://paste.ubuntu.com/8648109/

  Specifically, home/bug/test-with-true.sh changes the interesting parts
  of the algorithm to:

  1. wait for unity8 to start (this ensures the apparmor upstart job is finished)
  2. restore apparmor_parser and aa-clickhook, if needed
  3. if /home/bug/profiles... exists, perform a diff -Naur /home/bug/profiles...
     /var/lib/apparmor/profiles and fail if differences (note, apparmor_parser
     and aa-clickhook were /bin/true during boot so they could not have changed
     /var/lib/apparmor/profiles)
  4. verify the profiles, exit with error if they do not
  5. alternately upgrade/downgrade the packages
  6. verify the profiles, exit with error if they do not
  7. copy the known good profiles in the previous step to /home/bug/profiles...
  8. have apparmor_parser and aa-clickhook point to /bin/true
  9. reboot
  10. go to step 1

  In the paste you'll notice that in step 6 the profiles were
  successfully created by the installation of the packages, then
  verified, then copied aside, then apparmor_parser and aa-clickhook
  diverted, then rebooted, only to have the profiles in
  /var/lib/apparmor/profiles be different than what was copied aside. It
  would be nice to verify on your device as well (I reproduced several
  times here) and verify the reproducer algorithm. I think this suggests
  this is a kernel issue and not userspace.

  IMPORTANT: you will want to update the reproducer and refollow all of these steps (ie, I updated the scripts, the debs, the sudoers file, etc):
  $ wget http://people.canonical.com/~jamie/cking/aa-corruption.tar.gz
  $ tar -zxvf ./aa-corruption.tar.gz
  ...

  $ adb push ./aa-corruption.tar.gz /tmp
  $ adb shell
  phablet@ubuntu-phablet:~$ cd /tmp
  phablet@ubuntu-phablet:~$ tar -zxvf ./aa-corruption.tar.gz
  phablet@ubuntu-phablet:~$ sudo mount -o remount,rw /
  phablet@ubuntu-phablet:~$ sudo cp ./aa-corruption/etc/sudoers.d/phablet
  /etc/sudoers.d/
  phablet@ubuntu-phablet:~$ sudo mount -o remount,ro /
  phablet@ubuntu-phablet:~$ sudo cp -a ./aa-corruption/home/bug /home
  phablet@ubuntu-phablet:~$ exit
  $ cd ./aa-corruption
  $ ./test-from-host.sh
  ...

  The old script is still in place. Simply adjust ./test-from-host.sh to have:
  testscript=/home/bug/test.sh
  #testscript=/home/bug/test-with-true.sh"

  The kernel team has verified the above reproducer and symptoms.

  Related bugs:
  * bug 1371771
  * bug 1371765
  * bug 1377338

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1387214/+subscriptions
References

[Bug 1387214] [NEW] file corruption on touch images in rw portions of the filesystem
From: Jamie Strandboge, 2014-10-29