kernel-packages team mailing list archive

Thread
Date
Re: [Bug 1202994] Re: EXT4 filesystem corruption with uninit_bg and error=continue

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Dave Gordon <Dave.Gordon@xxxxxxxxxxxx>
Date: Sat, 20 Jul 2013 01:04:28 -0000
Reply-to: Bug 1202994 <1202994@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
On 19/07/13 18:34, Joseph Salisbury wrote:
> The commit(b0dd6b7) you mention in the upstrem bug report is in the 3.2 stable tree as commit 76f4fa4:
> * 76f4fa4 - ext4: fix the free blocks calculation for ext3 file systems w/ uninit_bg (1 year, 1 month ago) <Theodore Ts'o>
>
> It was available as of 3.2.20 as you say:
>   git describe --contains 76f4fa4
> v3.2.20~1
>
> This means that patch is in the 3.2.0-49 Ubuntu kernel, since it
> contains all the upstream 3.2.46 updates.
>
> The patch from Darrick J Wong that you mention is still being discuss on the linux-ext4 mailing list and is not yet available in the mainline kernel tree:
>   ext4: Prevent massive fs corruption if verifying the block bitmap fails
>
> Do you have a way to easily reproduce this bug?  If so, I can build a
> test kernel with Darrick's patch for you to test.

'Fraid not -- it's a one-off event (I hope!).

The filesystem in question (/export/share - mostly used for backups of 
other machines and ISO boot images) had originally been created on a 
logical volume of ~640Gb in a volume group of just under 1Tb on a single 
PV composed of a RAID10 array of two 1Tb partitions, one on each of two 
2Gb SATA disks. *At some later time* this LV was expanded to use the 
rest of the free space in that volume group, making it 800Gb, and *the 
filesystem was resized *to match*-- this may have been a contributing 
factor.*

This week, because the FS was getting quite full (about ~97% or *~30Gb 
left, i.e. within the last ~40G **r**eserved for root - could this be 
part of the trigger?*), I decided to install two spare disks so that I 
could migrate this VG onto them. This involved a power cycle, reboot, 
and lots of playing around with mdadm -- but I don't think any of this 
was significant.

After reboot, I had all 4 disks accessible, with no errors. One of the 
new disks was virgin, and I had created a new RAID10 mirror using it:

    # mdadm --create /dev/md/scratch --bitmap=internal --level=10
--parity=f2 --raid-devices=2 --name=new missing /dev/sdd1


The other was recycled from another machine, and already had MD/LVM 
volumes on it, which were correctly recognised as "foreign" 
arrays/volumes. I mounted the one that still contained the system image 
from the other machine and copied it into a subdirectory of 
/export/share (specifically, Backups/Galaxy/suse-11.4/ -- see below) 
using rsync -- *about 15Gb of data, using up about half the remaining 
(reserved) space. **This was the last write operation on the FS*. (I ran 
rsync again immediately afterwards, to verify that all files had been 
transferred with no errors. and all seemed OK. Nonetheless, *I think 
this is where the corruption occurred*.)

Then I dismantled the foreign LV/MD stack, wiped that disk, and made it 
part of the new RAID10 array, triggering a resync. Then I added the new 
array to the existing VG and migrated the LVs in it to the new array 
using pvmove.

The pvmove completed without errors, so I then removed the original 
array from the VG. (The raid remirroring completed without errors too, 
but I'm not sure when, probably later). Now that the VG was on a bigger 
disk, I decided to expand each of the LVs on it. Then when I tried to 
resize /export/share to use the expanded space, I was told I should run 
e2fsck first - which reported many errors, starting with:

    e2fsck 1.42 (29-Nov-2011)
    e2fsck: Group descriptors look bad... trying backup blocks...
    One or more block group descriptor checksums are invalid.  Fix<y>? yes

    Group descriptor 0 checksum is invalid.  FIXED.
    Group descriptor 1 checksum is invalid.  FIXED.
    Group descriptor 2 checksum is invalid.  FIXED.
    Group descriptor 3 checksum is invalid.  FIXED.
    ... etc etc ...
    Group descriptor 6397 checksum is invalid.  FIXED.
    Group descriptor 6398 checksum is invalid.  FIXED.
    Group descriptor 6399 checksum is invalid.  FIXED.
    Pass 1: Checking inodes, blocks, and sizes
    Group 2968's block bitmap at 97248129 conflicts with some other fs block.
    Relocate<y>? yes

    Relocating group 2968's block bitmap from 97248129 to 96998147...

    Running additional passes to resolve blocks claimed by more than one inode...
    Pass 1B: Rescanning for multiply-claimed blocks
    Multiply-claimed block(s) in inode 24248332: 97255511 97255512 97255513 97255514 97255515 97255516 97255517 97255518 97255519 97255520 97255521 97255522 97255523 97255524 97255525 97255526 97255527 97255528 97255529 97255530 97255531 97255532 97255533 97255534 97255535 97255536 97255537 97255538 97255539 97255540 97255541 97255542 97255543 97255544 97255545 97255546 97255547 97255548 97255549 97255550 97255551 97255552 97255553 97255554 97255555 97255556 97255557 97255558 97255559 97255560 97255561 97255562 97255563 97255564 97255565 97255566 97255567 97255568 97255569 97255570 97255571 97255572 97255573 97255574 97255575 97255576 97255577 97255578 97255579 97255580 97255581 97255582 97255583 97255584 97255585 97255586 97255587 97255588 97255589 97255590 97255591 97255592 97255593 97255594 97255595 97255596 97255597 97255598 97255599 97255600 97255601 97255602 97255603 97255604 97255605 97255606 97255607 97255608 97255609 97255610 97255611 97255612 97255613 97255614 97255615 97255616 97255617 97255618
    97255619 97255620 97255621 97255622 97255623 97255624 97255625 97255626 97255627 97255628 97255629 97255630 97255631 97255632 97255633 97255634 97255635 97255636 97255637 97255638 97255639 97255640 97255641 97255642 97255643 97255644 97255645 97255646
    ... etc etc ...
    Multiply-claimed block(s) in inode 24270904: 97263482 97263483
    Multiply-claimed block(s) in inode 24270909: 97263574 97263575
    Multiply-claimed block(s) in inode 24270931: 97263606 97263607
    Pass 1C: Scanning directories for inodes with multiply-claimed blocks
    Pass 1D: Reconciling multiply-claimed blocks
    (There are 1334 inodes containing multiply-claimed blocks.)

    File /Backups/Tesseract/DrivingLicenceReverse_300dpi.bmp (inode #24248332, mod time Thu Mar 25 01:34:37 2010)
       has 136 multiply-claimed block(s), shared with 7 file(s):
             /Backups/Galaxy/suse-11.4/bin/bash (inode #24269252, mod time Thu Jul 12 20:04:07 2012)
             /Backups/Galaxy/suse-11.4/bin/basename (inode #24269251, mod time Wed Sep 21 16:30:45 2011)
             /Backups/Galaxy/suse-11.4/bin/arch (inode #24269250, mod time Wed Sep 21 16:30:45 2011)
             /Backups/Galaxy/suse-11.4/.local/share/applications/defaults.list (inode #24269249, mod time Mon Sep 12 19:44:00 2011)
             /Backups/Galaxy/suse-11.4/.config/Trolltech.conf (inode #24269248, mod time Wed Oct 26 13:59:14 2011)
             /Backups/Galaxy/suse-11.4/profilerc (inode #24269247, mod time Mon Sep 12 19:44:00 2011)
             /Backups/Galaxy/suse-11.4/C:\nppdf32Log\debuglog.txt (inode #24269246, mod time Sun Sep  9 14:37:47 2012)
    Clone multiply-claimed blocks<y>? yes

    File /Backups/Tesseract/wla_user_guide.pdf (inode #24248352, mod time Thu Nov 13 12:18:26 2003)
       has 1310 multiply-claimed block(s), shared with 107 file(s):
             /Backups/Galaxy/suse-11.4/bin/tcsh (inode #24269354, mod time Sat Feb 19 02:49:24 2011)
             /Backups/Galaxy/suse-11.4/bin/tar (inode #24269353, mod time Tue Jan  3 00:33:47 2012)
             /Backups/Galaxy/suse-11.4/bin/sync (inode #24269352, mod time Wed Sep 21 16:30:49 2011)
             /Backups/Galaxy/suse-11.4/bin/su (inode #24269351, mod time Wed Sep 21 16:30:49 2011)
             /Backups/Galaxy/suse-11.4/bin/stty (inode #24269350, mod time Wed Sep 21 16:30:48 2011)
             /Backups/Galaxy/suse-11.4/bin/stat (inode #24269349, mod time Wed Sep 21 16:30:48 2011)
             /Backups/Galaxy/suse-11.4/bin/spawn_login (inode #24269348, mod time Sat Feb 19 02:46:10 2011)
             /Backups/Galaxy/suse-11.4/bin/spawn_console (inode #24269347, mod time Sat Feb 19 02:46:10 2011)
    ... etc etc ...

On examining the contents of these files, it became evident that in each 
case the newly copied files in Backups/Galaxy/suse-11.4/ were correct, 
while the named files in Backups/Tesseract/... were corrupted. Hence my 
conclusion that some of the blocks already allocated to the latter were 
erroneously taken to be free and used for the new files copied in by rsync.

    ...
    File /Backups/Galaxy/suse-11.4/etc/gconf/gconf.xml.schemas/%gconf-tree-oc.xml (inode #24270909, mod time Sun Aug 14 21:50:15 2011)
       has 2 multiply-claimed block(s), shared with 2 file(s):
             <filesystem metadata>
             /Backups/Tesseract/Audio/Jack Ruston & Mark Edwards/The Man in the Picture, by Susan Hill (CD 1 of 3)/06__Chapter 5.ogg (inode #24248358, mod time Fri Feb  4 22:53:03 2011)
    Multiply-claimed blocks already reassigned or cloned.

    File /Backups/Galaxy/suse-11.4/etc/gconf/gconf.xml.schemas/%gconf-tree-wa.xml (inode #24270931, mod time Sun Aug 14 21:50:20 2011)
       has 2 multiply-claimed block(s), shared with 2 file(s):
             <filesystem metadata>
             /Backups/Tesseract/Audio/Jack Ruston & Mark Edwards/The Man in the Picture, by Susan Hill (CD 1 of 3)/06__Chapter 5.ogg (inode #24248358, mod time Fri Feb  4 22:53:03 2011)
    Multiply-claimed blocks already reassigned or cloned.

    Pass 2: Checking directory structure
    Pass 3: Checking directory connectivity
    Pass 4: Checking reference counts
    Pass 5: Checking group summary information
    Block bitmap differences:  +96998147
    Fix<y>? yes

    Free blocks count wrong for group #1133 (0, counted=156).
    Fix<y>? yes

    Free blocks count wrong for group #1134 (0, counted=943).
    Fix<y>? yes

    ... etc etc ...

    Free blocks count wrong for group #6019 (32768, counted=0).
    Fix<y>? yes

    Free blocks count wrong for group #6020 (32768, counted=0).
    Fix<y>? yes

    ...

    Directories count wrong for group #4465 (0, counted=29).
    Fix<y>? yes

    Free inodes count wrong (52421173, counted=51433277).
    Fix<y>? yes


    share: ***** FILE SYSTEM WAS MODIFIED *****

       995523 inodes used (1.90%)
         1231 non-contiguous files (0.1%)
          980 non-contiguous directories (0.1%)
              # of inodes with ind/dind/tind blocks: 0/0/0
              Extent depth histogram: 955338/210/3
    195882827 blocks used (93.40%)
            0 bad blocks
           38 large files

       859488 regular files
        90714 directories
           94 character device files
           64 block device files
           16 fifos
        79548 links
        44961 symbolic links (39613 fast symbolic links)
          177 sockets
    --------
      1075062 files

Because I suspected the FS might have been corrupted by pvmove shuffling 
its data between volumes (or even by the md remirroring process going on 
underneath that!), I put the old PV that I had recently removed from the 
VG into a new VG of its own, and used lvcreate/lvextend to resurrect the 
original copy of the FS:

    # lvcreate --verbose --name replay --extents 171751 --zero n test_vg /dev/md126:65536-
    # lvextend --verbose --extents 204800 /dev/test_vg/replay /dev/md126:30720-63768

Running

    # e2fsck -f -n /dev/test_vg/replay

showed exactly the same corruption. Thus it seems that the FS was 
already damaged before it was mirrored onto the new volume, which is why 
I suspect the problem lies in EXT4 rather than LVM or md.

Here's the output of dumpe2fs -h as it was after the corruption but 
before letting e2fsck fix it:

Filesystem volume name:   share
Last mounted on:          /export/share
Filesystem UUID:          80477518-0fea-447a-bece-f77fe26193bb
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              52428800
Block count:              209715200
Reserved block count:     10484660
Free blocks:              13897914
Free inodes:              51433277
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      974
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
RAID stride:              128
RAID stripe width:        256
Flex block group size:    16
Filesystem created:       Wed Feb  6 15:50:31 2013
Last mount time:          Mon Jul 15 17:51:37 2013
Last write time:          Mon Jul 15 18:01:03 2013
Mount count:              24
Maximum mount count:      -1
Last checked:             Thu Feb  7 18:33:49 2013
Check interval:           0 (<none>)
Lifetime writes:          480 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      5ff8295f-3988-40e0-b195-998d6e67aa31
Journal backup:           inode blocks
FS Error count:           1
First error time:         Mon Jul 15 18:01:03 2013
First error function:     ext4_mb_generate_buddy
First error line #:       739
First error inode #:      0
First error block #:      0
Last error time:          Mon Jul 15 18:01:03 2013
Last error function:      ext4_mb_generate_buddy
Last error line #:        739
Last error inode #:       0
Last error block #:       0
Journal features:         journal_incompat_revoke
Journal size:             128M
Journal length:           32768
Journal sequence:         0x0000645d
Journal start:            0

As it happens, only 13 existing files (containing a total of 65Mb of data between them) were damaged,
and they were mostly large but ancient and not very important content backed up from other machines.
So I've had something of a lucky escape; and I've subsequently changed all live volumes to use
errors=remount-ro rather than errors=continue, which I had never realised was the default!

I can provide any information you'd like about the corrupted FS, as I've preserved it in that state since
(modulo anything that might have been changed by mounting it read-only). But I don't have any way of finding
out what the internal state was when it was last mounted or immediately before the corruption occurred.

Hope this helps -- and let me know if there's anything you'd like me to
extract from the corrupted FS.

Ciao,
Dave

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1202994

Title:
  EXT4 filesystem corruption with uninit_bg and error=continue

Status in “linux” package in Ubuntu:
  Confirmed

Bug description:
  There was a long and complicated sequence of activities involving
  mdadm, lvm, and specifically pvmove leading up to the point where the
  corruption was discovered, but I suspect most were irrelevant. AFAICT,
  the bug was triggered by the following simple operations:

  * the FS was unmounted & remounted -- thus, the journal was fresh and hadn't wrapped (which other reports appear to indicate would have prevented the bug showing up)
  * the FS options include uninit_bg AND error=continue
  * a bunch of files were then copied onto the FS -- this was the last write operation on the FS.

  Later, e2fsck indicated a bunch of problems, including corrupted group
  descriptors. Specifically, it fould that many blocks were now claimed
  by two files; in each case, one was an old file and one was one of
  those newly copied, and the contents matched the expected data for
  latter.

  So I think this starts with an instance of the miscalculation of
  checksums in uninit_bg blocks (fixed by Ted Ts'o last June), followed
  by the (invalid or uninitialised) bitmap being used anyway (because
  error=continue) and the blocks it appeared to show as free then being
  allocated to new files.

  Jul 15 18:01:03 redshift kernel: [ 9332.021245] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:739: group 2968, 8105 clusters in bitmap, 0 in gd
  ...
  Jul 16 18:05:14 redshift kernel: [95982.560034] EXT4-fs (dm-1): error count: 1
  Jul 16 18:05:14 redshift kernel: [95982.560044] EXT4-fs (dm-1): initial error at 1373907663: ext4_mb_generate_buddy:739
  Jul 16 18:05:14 redshift kernel: [95982.560053] EXT4-fs (dm-1): last error at 1373907663: ext4_mb_generate_buddy:739
  ...
  Jul 16 20:53:19 redshift kernel: [106068.077526] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 0 failed (47831!=4825)
  Jul 16 20:53:19 redshift kernel: [106068.077540] EXT4-fs (dm-1): ext4_check_descriptors: Checksum for group 1 failed (14670!=8882)

  I see that in an astonishing display of synchronicity, Darrick J Wong
  filed a patch at 17 Jul 2013 04:02  -- the very next day, or maybe
  even the same day, depending on timezone -- to prevent the knockon
  effects (see "[PATCH] ext4: Prevent massive fs corruption if verifying
  the block bitmap fails" at http://permalink.gmane.org/gmane.comp.file-
  systems.ext4/39535 ).

  But what puzzles me is that the initial triggering bug is still in
  this kernel (vmlinuz-3.2.0-49-generic), when according to this
  conversation https://bugzilla.kernel.org/show_bug.cgi?id=42723#c8 the
  fix was backported to 3.2.20? Is it possible that there is another way
  of getting the "ext4_mb_generate_buddy:739" error?

  I have kept an e2image dump of the corrupted FS in case it's of any
  use to EXT4 developers, but it's not attached, as even in QCOW2 format
  it's ~1Gb.

  ProblemType: Bug
  DistroRelease: Ubuntu 12.04
  Package: linux-image-3.2.0-49-generic 3.2.0-49.75
  ProcVersionSignature: Ubuntu 3.2.0-49.75-generic 3.2.46
  Uname: Linux 3.2.0-49-generic x86_64
  AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
  ApportVersion: 2.0.1-0ubuntu17.3
  Architecture: amd64
  AudioDevicesInUse:
   USER        PID ACCESS COMMAND
   /dev/snd/controlC1:  dsg        7005 F.... pulseaudio
   /dev/snd/controlC0:  dsg        7005 F.... pulseaudio
  CRDA:
   country AW:
   	(2402 - 2482 @ 40), (N/A, 20)
   	(5170 - 5250 @ 40), (N/A, 20)
   	(5250 - 5330 @ 40), (N/A, 20), DFS
   	(5490 - 5710 @ 40), (N/A, 27), DFS
  Card0.Amixer.info:
   Card hw:0 'SB'/'HDA ATI SB at 0xfe024000 irq 16'
     Mixer name	: 'Realtek ALC892'
     Components	: 'HDA:10ec0892,1458a102,00100302'
     Controls      : 46
     Simple ctrls  : 21
  Card1.Amixer.info:
   Card hw:1 'HDMI'/'HDA ATI HDMI at 0xfdefc000 irq 19'
     Mixer name	: 'ATI RS690/780 HDMI'
     Components	: 'HDA:1002791a,00791a00,00100000'
     Controls      : 4
     Simple ctrls  : 1
  Card1.Amixer.values:
   Simple mixer control 'IEC958',0
     Capabilities: pswitch pswitch-joined penum
     Playback channels: Mono
     Mono: Playback [on]
  Date: Thu Jul 18 19:04:57 2013
  HibernationDevice: RESUME=UUID=2ab26064-3b90-475d-b3c2-51a70c2d990a
  InstallationMedia: Kubuntu 12.04.1 LTS "Precise Pangolin" - Release amd64 (20120822.2)
  MachineType: Gigabyte Technology Co., Ltd. GA-890GPA-UD3H
  MarkForUpload: True
  ProcEnviron:
   LANGUAGE=en_GB
   TERM=xterm
   PATH=(custom, no user)
   LANG=en_GB.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 radeondrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.2.0-49-generic root=/dev/mapper/system-kubuntu ro quiet splash vt.handoff=7
  RelatedPackageVersions:
   linux-restricted-modules-3.2.0-49-generic N/A
   linux-backports-modules-3.2.0-49-generic  N/A
   linux-firmware                            1.79.4
  RfKill:
   0: phy0: Wireless LAN
   	Soft blocked: yes
   	Hard blocked: no
  SourcePackage: linux
  UpgradeStatus: No upgrade log present (probably fresh install)
  dmi.bios.date: 07/23/2010
  dmi.bios.vendor: Award Software International, Inc.
  dmi.bios.version: FD
  dmi.board.name: GA-890GPA-UD3H
  dmi.board.vendor: Gigabyte Technology Co., Ltd.
  dmi.board.version: x.x
  dmi.chassis.type: 3
  dmi.chassis.vendor: Gigabyte Technology Co., Ltd.
  dmi.modalias: dmi:bvnAwardSoftwareInternational,Inc.:bvrFD:bd07/23/2010:svnGigabyteTechnologyCo.,Ltd.:pnGA-890GPA-UD3H:pvr:rvnGigabyteTechnologyCo.,Ltd.:rnGA-890GPA-UD3H:rvrx.x:cvnGigabyteTechnologyCo.,Ltd.:ct3:cvr:
  dmi.product.name: GA-890GPA-UD3H
  dmi.sys.vendor: Gigabyte Technology Co., Ltd.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1202994/+subscriptions
References

[Bug 1202994] [NEW] EXT4 filesystem corruption with uninit_bg and error=continue
From: Dave Gordon, 2013-07-19
[Bug 1202994] Re: EXT4 filesystem corruption with uninit_bg and error=continue
From: Joseph Salisbury, 2013-07-19