kernel-packages team mailing list archive

Thread
Date

[Bug 1561830] Re: Hard disk writes fail in 16.04 daily on nForce 430

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Stephen Worthington <1561830@xxxxxxxxxxxxxxxxxx>
Date: Fri, 01 Apr 2016 09:34:40 -0000
Reply-to: Bug 1561830 <1561830@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

After quite a bit of messing around, I found a way to test kernels on
16.04 beta properly. I installed my HightPoint Rocket 622A eSATA card
and plugged the Samsung HD103UJ drive into it using a long eSATA to SATA
cable. That allowed me to boot from my 16.04 daily DVD and do an
install to the HD103UJ without any problems. I did an apt-get upgrade
and got the latest kernel (4.4.0-16-generic), and also installed the
mainline kernels 4.4.0-040400-generic and 4.5.0-040500-generic. Then I
checked the other drive installed on the box (Seagate ST31000528AS, used
for testing Windows 10) and found that its 200 Gibyte NTFS data
partition was almost empty, so I resized it and created a tiny EXT2
partition for Grub2, a 50 Gibyte EXT4 partition for intalling to, and a
10 Gibyte swap partition, all at the end of that drive. Then I rebooted
to the 16.04 beta DVD and installed to the Seagate drive. Then I
rebooted to the install on the Samsung drive and ran update-grub, to get
the install on the Seagate drive bootable from the Grub on the Samsung
drive. Then I booted using the Samsung drive and selected the new 16.04
beta install on the Seagate drive to boot. It did, to my surprise, as
the Seagate drive was on the motherboard nForce 430 SATA controller. So
I then mounted the Samsung drive from the booted Seagate install, and
tried the test dd commands, and they also all worked with no errors.

So the first conclusion I have come to is that the bug seems to only be
triggered by the Samsung HD103UJ drive when it is on a motherboard
nForce 430 SATA port. It does not happen when that drive is on the
Rocket 622A's Marvell SATA port. And the bug also does not happen when
using the Seagate ST31000528AS drive on a motherboard nForce 430 SATA
port. It seems to require that particular drive on that particular SATA
controller, and using the standard 16.04 beta kernels, for the bug to
occur.

To prevent problems with the swapper using the swap partition on the
Samsung HD103UJ drive, I edited fstab on both 16.04 installs to use the
new swap partition on the Seagate ST31000528AS drive only.

The next test was to shut down and move the Samsung HD103UJ to its
motherboard nForce 430 SATA port, then reboot using the Grub on that
drive to run the install on the Seagate ST31000528AS drive. Again, the
boot worked, which I expected as there should be little or no writing to
the Samsung drive during that boot process. I mounted the Samsung drive
16.04 install partition from the Seagate install, and ran the test dd
commands. I was again surprised that they worked without errors - I
would have expected that a boot of the 16.04 beta standard kernels from
that drive would work the same as a boot of the 16.04 standard kernel
from my install DVD, and would fail when writing to the Samsung HD103UJ
drive when it is on the motherboard nForce 430 SATA port.

The next test was to reboot to the 16.04 beta partition on the Samsung
HD103UJ drive. As expected, that boot failed badly, and I had to use
the PC's reset button to restart it, after which I rebooted to the 16.04
beta install on the Seagate ST31000528AS drive again and used that
install to run fsck to repair the 16.04 beta install partition on the
Samsung HD103UJ drive. The fsck check showed two errors that needed
fixing, where the number of blocks and number of inodes were both wrong.
Once fsck had fixed the partition, I mounted it and looked at the
kern.log file from the bad boot. It looked normal up to a certain
point, after which it was corrupt - I think it had a block full of
zeroes. So it looks like as soon as the bug hits, no more successful
log writes occur, which makes it difficult to debug.

I do have a serial port on this motherboard, so I looked to see if I
could use that to get debug information during a bad boot, but it turned
out that I do not have the necessary serial cross-over cable to plug the
motherboard's serial port into any of my other PCs' serial ports. Last
time I needed a cross-over cable, I must have borrowed one from work,
and unfortunately that is no longer possible.

The next test I ran was to boot the Samsung HD103UJ install on the
nForce 430 port, but using Grub to select the mainline
4.4.0-040400-generic kernel. That also failed badly in exactly the sam
manner, so I rebooted and repaired the partition again, ready for the
final test.

For the last test, I rebooted to the Samsung HD103UJ install on the
nForce 430 port using the mainline 4.5.0-040500-generic kernel, and it
booted without errors.

So it looks like whatever bug is causing this problem has already been
fixed in the upstream 4.5.0 kernels. However, if 16.04 is going to be
released using 4.4.0 kernels, I hope the fix for this bug can be
backported before 16.04 is released. Are there any more tests I should
do to help with this? Is there any more information I can provide?

--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1561830

Title:
Hard disk writes fail in 16.04 daily on nForce 430

Status in linux package in Ubuntu:
Confirmed

Bug description:
I have an old PC I use for testing new operating systems. It has
previously had Ubuntu 15.10 installed and working. The motherboard is
an Asus M2NPV-VM, with Nvidia nForce 430 chipset and Nvidia GeForce
6150 GPU. I have installed an Nvidia GT220 card to use for more
modern video support.

When I attempt to install Ubuntu 16.04 beta (daily xenial-desktop-
amd64.iso file downloaded 24/03/2016 18:17), it starts to write to
the hard disk (Samsung HD103UJ), and after a short time the install
got lots of disk write errors in kern.log. After the errors, the disk
was unable to be read either, with "fdisk -l /dev/sda" failing to read
a sector, where it had worked before starting the install. Unplugging
the SATA cable to the drive and plugging it in again made the drive
work again (on /dev/sdc), but another attempt to install failed with
the same write errors.

I noticed that the log had swap write errors also, so I rebooted the
install DVD again, and this time did a "swapoff -a" command before
attempting to install, but got the same errors again. So I found my
Ubuntu 15.10 install DVD and tried a new install from that, which
worked just fine.

On rebooting with my 16.04 daily DVD, I again did "swapoff -a" so that
the DVD based system would run normally, then tried mounting the EXT4
system partition I had just installed using the 15.10 install DVD.
That worked, so I tried dd commands to do test writes to that
partition. The following commands worked:

dd if=/dev/zero of=/mnt/sda8/tmp/output bs=8k count=10k
dd if=/dev/zero of=/mnt/sda8/tmp/output bs=8k count=100k

but when I did this command:

dd if=/dev/zero of=/mnt/sda8/tmp/output bs=8k count=1000k

after a while errors started appearing in kern.log, just as with the
attempts to install 16.04.

It appears that with sustained write activity, the errors will start
and then the drive will become unusable until it is unplugged and
plugged in again.

I have attached the kern.log and syslog files from the 15.10 install
that worked, and the 16.04 install attempt that failed. The first
error message appears to be this:

ata3: EH in SWNCQ mode,QC:qc_active 0x1FFF sactive 0x1FFF
ata3: SWNCQ:qc_active 0x1 defer_bits 0x1FFE last_issue_tag 0x0
dhfis 0x1 dmafis 0x0 sdbfis 0x0

which leads me to suspect a problem with the handling of the SATA controller's interrupts.
---
ApportVersion: 2.20-0ubuntu3
Architecture: amd64
AudioDevicesInUse:
USER PID ACCESS COMMAND
/dev/snd/controlC0: ubuntu 2233 F.... pulseaudio
/dev/snd/controlC1: ubuntu 2233 F.... pulseaudio
CasperVersion: 1.368
DistroRelease: Ubuntu 16.04
IwConfig:
enp0s20 no wireless extensions.

lo no wireless extensions.

enp2s9 no wireless extensions.
LiveMediaBuild: Ubuntu 16.04 LTS "Xenial Xerus" - Beta amd64 (20160323)
Lsusb:
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 002: ID 0458:0118 KYE Systems Corp. (Mouse Systems)
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: System manufacturer System Product Name
Package: linux (not installed)
ProcEnviron:
TERM=xterm-256color
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 nouveaufb
ProcKernelCmdLine: file=/cdrom/preseed/hostname.seed boot=casper initrd=/casper/initrd.lz quiet splash ---
ProcVersionSignature: Ubuntu 4.4.0-15.31-generic 4.4.6
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
linux-restricted-modules-4.4.0-15-generic N/A
linux-backports-modules-4.4.0-15-generic N/A
linux-firmware 1.157
RfKill:

Tags: xenial
Uname: Linux 4.4.0-15-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 08/07/2008
dmi.bios.vendor: Phoenix Technologies, LTD
dmi.bios.version: ASUS M2NPV-VM ACPI BIOS Revision 1401
dmi.board.name: M2NPV-VM
dmi.board.vendor: ASUSTek Computer INC.
dmi.board.version: 1.xx
dmi.chassis.asset.tag: 123456789000
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnPhoenixTechnologies,LTD:bvrASUSM2NPV-VMACPIBIOSRevision1401:bd08/07/2008:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTekComputerINC.:rnM2NPV-VM:rvr1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1561830/+subscriptions