kernel-packages team mailing list archive
-
kernel-packages team
-
Mailing list archive
-
Message #09504
[Bug 1216397] Re: It should be possible to ignore (skip probing) a known bad disk partition at boot
** Attachment added: "syslog_pc_better"
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1216397/+attachment/3786323/+files/syslog_pc_better
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1216397
Title:
It should be possible to ignore (skip probing) a known bad disk
partition at boot
Status in “linux” package in Ubuntu:
Incomplete
Bug description:
Hi all,
I guess this isn't exactly a bug - maybe more of a wishlist; but I
thought it'd be good to log it while I'm experiencing the problem.
Basically, I'd like a consistent kernel interface to mark faulty
partitions (or drives) at boot time, regardless of the kernel
subsystem (IDE, ATA) they are attributed to - but first, here is a
(somewhat lengthy) exposition of my problem.
I have a desktop PC, which some months ago, experienced a hard disk failure - I heard a loud scratching noise during disk operation, after which the bootable hard disk partition was unusable. At the time, I had Ubuntu Lucid installed on that bootable partition. Actually handling the broken disk at this time was/is not an option for me, as it would open up a whole lotta other (unrelated) work which I have to postpone to a future date. So, I've been using this PC in the past months primarily through the use of live bootable media - booting from live USB flash thumbdrive, to be exact.
The problem here is that, at boot, the kernel starts probing all disks
indiscriminately. Certaing distributions on the live USB I've tried,
like PartedMagic or SliTaz, start probing and encounter the bad
partitions; then 4-5 loud clicks can be heard from the drive, followed
by access error messages in the log - that takes about 20-30 secinds,
and then the the respective kernels give up, and the boot procedure
completes successfully. Note that these distributions will recognize
both the bad and the healthy partitions on this drive, and I have been
mounting and using the healthy partitions from these live USB distros
without a problem.
However, when I try an Ubuntu-based live USB: when the kernel
encounters the bad partition, it starts looping and accessing the
drive, and loud clicks (followed by access errors in logs) can be
heard way more often; and this loop can last up to 5-10 minutes - in
all, a rather unhealthy experience. The latest I have tried is a
Precise 12.04 based custom distro - based on Ubuntu Mini Remix with
few extra packages, built using `ubuntu-builder`, with the `casper`
files transferred to USB stick, previously made bootable manually via
`syslinux`. With this distro's boot, I waited out the 272.417 seconds
(some 4.5 mins) where this error loop occured, and the system finally
booted; so I could obtain the logs (/var/log/syslog &
/var/log/boot.log), that I am attaching to this post (syslog_pc_bad_hd
and boot_pc_bad_hd.log). Thus, we can now see the messages spit out by
the kernel on encountering the problem:
[ 247.251272] ata5.00: failed command: READ DMA
[ 247.254158] ata5.00: cmd c8/00:08:18:01:00/00:00:00:00:00/e0 tag 0 dma 4096 in
[ 247.254160] res 51/40:00:18:01:00/00:00:00:00:00/e0 Emask 0x9 (media error)
[ 247.259985] ata5.00: status: { DRDY ERR }
[ 247.262905] ata5.00: error: { UNC }
[ 247.288574] ata5.00: configured for UDMA/33
[ 247.291390] ata5: EH complete
[ 248.614902] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 248.617781] ata5.00: BMDMA stat 0x24
[ 248.620647] ata5.00: failed command: READ DMA
[ 248.623515] ata5.00: cmd c8/00:08:18:01:00/00:00:00:00:00/e0 tag 0 dma 4096 in
[ 248.623517] res 51/40:00:18:01:00/00:00:00:00:00/e0 Emask 0x9 (media error)
[ 248.629356] ata5.00: status: { DRDY ERR }
[ 248.632265] ata5.00: error: { UNC }
[ 248.656576] ata5.00: configured for UDMA/33
[ 248.659393] ata5: EH complete
...
[ 254.136571] ata5.00: configured for UDMA/33
[ 254.139458] sd 5:0:0:0: [sdb] Unhandled sense code
[ 254.142395] sd 5:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 254.145393] sd 5:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
[ 254.148418] Descriptor sense data with sense descriptors (in hex):
[ 254.151422] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[ 254.154438] 00 00 01 18
[ 254.157448] sd 5:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
[ 254.160556] sd 5:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 01 18 00 00 08 00
[ 254.163673] end_request: I/O error, dev sdb, sector 280
[ 254.166742] Buffer I/O error on device sdb, logical block 280
[ 254.169819] Buffer I/O error on device sdb, logical block 281
[ 254.172915] Buffer I/O error on device sdb, logical block 282
[ 254.175978] Buffer I/O error on device sdb, logical block 283
[ 254.178991] Buffer I/O error on device sdb, logical block 284
[ 254.181951] Buffer I/O error on device sdb, logical block 285
[ 254.184878] Buffer I/O error on device sdb, logical block 286
[ 254.187688] Buffer I/O error on device sdb, logical block 287
[ 254.190516] ata5: EH complete
... and as it can be seen from the log, an access is made each second
or so.
Now, I cannot really tell, if in all the other previous times when
I've seen this error, the kernel would have booted if I waited the
faulty accesses loop out - however, the sounds made are just so
horrible and unhealthy, I never _dared_ to wait them out previosly
(since I eventually intend to try to clone/salvage the broken
partition when I get the time for it).
From the log, it's obvious that an ATA-related subsystem is the one
handling the faulty partition; unfortunately, I couldn't really see
what driver is at fault - until I waited out the problematic loop, and
the OS booted, so I could issue `lsmod | grep -i ata`. When this boot
run gave me opportunity to do so, the lsmod command revealed that the
only matching driver is `pata_atiixp`. Note that after this error is
waited out and the OS boots, the healthy partitions of the disk are
visible (and probably mountable) in the OS through `sudo fdisk -l`.
So, at this point - what I would have best liked, would be a way to
"mark" the unhealthy partition at boot time - possibly with a boot
parameter to the `kernel` entry in `syslinux`; such that when the
driver encounters the bad drive, it would simply "skip" it, after the
first faulty access - and then proceed with the boot process. I don't
know enough technicals so I could tell whether this is
possible/realistic, though: the log messages say the error is
encountered on the _drive_ (sdb), not on the partition (I think it was
sdb5, but not sure).
At this point, I found resources like:
http://askubuntu.com/questions/230396/can-i-prevent-an-identify-
packet-device-command-to-a-specific-device-at-boot
or:
http://serverfault.com/questions/112147/tell-ubuntu-to-ignore-dead-
hard-drive-during-booting
> I know my secondary hard drive is bad, but I cannot take it out.
> Every Reboot takes forever because Ubuntu tries to read from it for
> a long time and reports errors:
>
> [ 228.984480] sd 0:0:1:0: [sdb] Add. Sense:
> Unrecovered read error - auto reallocate failed
> ...
> Looks like there may be a way to tell udev to ignore it, though I
> don't have access to a system right now on which to test this.
>
> As root, open up /etc/udev/rules.d/60-persistent-storage.rules [...]
> Add "sdb*" to that second line, so it looks like this:
>
> KERNEL=="ram*|loop*|fd*|nbd*|gnbd*|dm-*|md*|sdb*", GOTO="persistent_storage_end"
>
> Save the file and then reboot.
So, I tried this with the `60-persistent-storage.rules` file that
ended up on the USB flash image - I tried these changes:
# commenting these lines:
##ACTION=="add", SUBSYSTEM=="module", KERNEL=="block", ATTR{parameters/events_dfl_poll_msecs}=="0", ATTR{parameters/events_dfl_poll_msecs}="2000"
##ACTION=="add", ATTR{removable}=="1", ATTR{events_poll_msecs}=="-1", ATTR{events_poll_msecs}="2000"
# adding sdb here:
KERNEL=="fd*|mtd*|nbd*|gnbd*|btibm*|dm-*|md*|sdb*", GOTO="persistent_storage_end"
# commenting these lines:
##ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_UUID}=="?*", SYMLINK+="disk/by-partuuid/$env{ID_PART_ENTRY_UUID}"
##ENV{ID_PART_ENTRY_SCHEME}=="gpt", ENV{ID_PART_ENTRY_NAME}=="?*", SYMLINK+="disk/by-partlabel/$env{ID_PART_ENTRY_NAME}"
... and they made absolutely no difference - the same troublesome
faulty-access loop still occurs without any changes.
Then I found
http://askubuntu.com/questions/145965/how-do-i-target-a-specific-
driver-for-libata-kernel-parameter-modding
> At http://www.kernel.org/doc/Documentation/kernel-parameters.txt, it states,
>
> libata.force= [LIBATA] Force configurations. The format is comma
> separated list of "[ID:]VAL" where ID is
> PORT[.DEVICE]. PORT and DEVICE are decimal numbers
> matching port, link or device. [...]
>
> So, in my setup, it seems I need to do
>
> libata.force=1:1.5G,2:1.5G,3:1.5G
So I tried appending these to the `syslinux` boot entry:
libata.force=5.00:ign
libata.force=5.00:norst
... however, they make absolutely no difference either.
Looking further through Documentation/kernel-parameters.txt, I saw:
> ide-core.nodma= [HW] (E)IDE subsystem
> Format: =0.0 to prevent dma on hda, =0.1 hdb =1.0 hdc
> .vlb_clock .pci_clock .noflush .nohpa .noprobe
... and so I tried to add these to the `syslinux` boot entry:
sdb=noprobe
acpi=off noapic sdb=none
... and those didn't work either - apparently since this drive is not
handled as IDE, and it's not recognized as `hd` - but as `sd` device.
However, note that when, say, SliTaz boots, when it encounters the
faulty drive, it spits out something like:
hda: dma_intr: status=0x51 { DriveReady SeekCompleteError }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=280, sector=280
hda: possibly failed opcode: 0xc8
end_request: I/O error, dev hda, sector 280
Buffer I/O error on device hda, logical block 280
... so I gather that, basically:
* the SliTaz Linux kernel (not sure with which driver) sees the drive as `hda` - while
* the Ubuntu Linux kernel (with `pata_atiixp` driver, apparently) sees the same drive as `sdb`.
Because of this difference, I compared the `60-persistent-
storage.rules` from Ubuntu and SliTaz - and those differed in only
four lines, the ones that say "commenting these lines" in my snippet
above; needless to say, that wasn't enough to get a change in
response.
Seeing how this could be driver related, as a last resort, I thought of suppressing the `pata_atiixp` driver from loading at boot time. For this, I found the following page:
https://help.ubuntu.com/lts/installation-guide/i386/boot-parms.html
> You can blacklist a module using the following syntax: module_name.blacklist=yes.
> This will cause the module to be blacklisted in /etc/modprobe.d/blacklist.local
> both during the installation and for the installed system.
So I decided to add `pata_atiixp.blacklist=yes` to the `syslinux` boot
line - the whole entry being:
label ubuntu1204mini
kernel /ubu1204/casper/vmlinuz
append boot=casper initrd=/ubu1204/casper/initrd.lz live-media-path=/ubu1204/casper/ ignore_uuid noplymouth pata_atiixp.blacklist=yes --
Now, this apparently _did_ cause something - because I couldn't hear
many faulty accesses; the boot process completed superfast in
comparison - and the faulty disk (including its healthy partitions)
are no longer listed by `sudo fdisk -l` after the OS boots from the
USB thumbdrive. The logs from this will be posted as
`syslog_pc_better` and `boot_pc_better.log`; from the syslog, it is
notable that we see:
[ 7.559862] pata_atiixp: Unknown parameter `blacklist'
... so apparently the syntax I used (listed on the Ubuntu wiki) is no
longer valid; I'm not sure if it is this, that suppressed
`pata_atiixp` to load - but for sure, `pata_atiixp` doesn't get listed
by `lsmod` anymore, once the OS boots. So, what I conclude - with this
Ubuntu 12.04-based kernel:
* I can leave the boot process as is, and use `pata_atiixp`; encounter a long loop with unhealthy sounding faulty drive accesses during boot; and have access to the healthy partitions when OS boots
* I can change the boot process, so `pata_atiixp` is not loaded; experience a very fast boot in comparison; but not have access to the healthy partitions anymore when OS boots.
And here is the crux of my wish: When a partition of a drive is broken, but the other partitions of the drive are healthy and can be used, it would be great if the user could ignore/blacklist/mark the bad partition as faulty at boot using a boot parameter, and have the kernel ignore it during boot probing: either ignore it completely (as in not attempt probe at all) - or once an error is encountered during probe, the kernel would stop probing it and exit, instead of repeatedly looping and repeating the access error.
Now, there are obviously kernel boot parameters for that purpose for
IDE - and maybe there are some more specific parameters I could have
used in this ATA case of mine, but I didn't have luck in finding them
in time. And since hard disk subsystems and drivers are an extremely
complex area to understand, it is very difficult for me (the end user)
to figure out what would be the appropriate boot switches (if any) per
driver/subsystem. Thus, I would have ideally wished for a **single**
kernel parameter, say `ignore_part`, which could be appended to kernel
boot line, e.g. in `syslinux`:
label ubuntu1204mini
kernel /ubu1204/casper/vmlinuz
append boot=casper initrd=/ubu1204/casper/initrd.lz ignore_part=UUID:"a4ae3a96..." --
... which would then propagate through the boot process, and reach
any/all drivers which might handle hard disks (block devices?),
regardless of what subsystem (IDE, ATA, SCSI) they may belong to.
Then, if a particular driver encounters errors during probe of this
particular device, this parameter would instruct it to stop further
probing, and continue with the rest of the boot process - instead of
senselessly looping in faulty accesses (which are likely to worsen the
condition of the drive), until the rest of the kernel has a chance to
wrestle control out of the fault, and proceed with the boot some
minutes after.
Now, I don't know enough about drives/block devices, to know to what
extent is the above possible per-partition. I know that at boot start,
labels like 'sdb' may not even be attributed to devices yet - but I
speculate that: if the master partition record of a disk is healthy -
then when the driver, once it encounters error on "logical block 280"
of "sdb", _might_ be able to calculate that this block 280 is in the
scope of partition "sdb5" or UUID:"a4ae3a96..." - and as such, could
determine that repeating the access, in case of error for that
location, is not worth it (and thus avoid the loop). Obviously, the
"sdb" label may change - but given that, per OS, it seems to be
consistent, it could simply be remembered as a parameter string during
the boot process, to which a driver could compare the name of its
current device once it obtains it (UUIDs would be better here, I
guess). The user would typically boot once, note the errors and the
name under which the fault is encountered - and then reboot, adding
that name as a known bad partition. In any case, if it is not possible
per-partition, I guess it should be possible to do per-drive faulty
marking (although that would make the healthy partitions ultimately
unavailable).
Well, I believe that exhausts most of my thoughts about this - it would be nice to know, to what extent is something like this possible to implement in the current Linux architecture...
Thanks for the attention,
Cheers!
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1216397/+subscriptions
References