← Back to team overview

kernel-packages team mailing list archive

[Bug 1328984] Re: [Dell PowerEdge R510] Regression: Kernel 3.2.0-64 fails to boot with USB3 controller card

 

I have tested kernels 3.16.0-031600rc1-generic and 3.2.60-030260-generic. On the former, the problem does not appear, on the latter, the bug is replicated with similar symptoms as on 3.2.0-64. I used a flash drive with a vanilla Ubuntu 12.04 desktop install for all tests. To summarize kernels tested so far:
Good kernels: 3.2.0-63, 3.16.0-031600rc1
Bad kernels: 3.2.0-64, 3.2.60-030260

I also tested this issue on three additional machines, and the results
were the same. So I have now five different hardware configurations
(including one from bug 1330530) that are affected by this problem and
show very similar symptoms. In fact, I was not able to find a computer
that would not replicate this regression. If we also take into account
Bard Hemmer's hardware, we can reasonably conclude that the issue is not
related to motherboard/chipset/CPU/BIOS. It is however related to
HighPoint RocketU 1144C add-in adapter that I used in all my tests.

I would like to note that symptoms are similar on various hardware, but
not identical. The errors are generally similar (xhci, udev, modprobe),
but it appears that timing differences cause the issue to occur at
different parts of the boot process, depending on the hardware. So far I
have seen:

1. Dropping to initramfs shell in the middle of the boot ("Gave up
waiting for root device." ... ALERT! [boot drive] does not exist!
Dropping to shell!")

2. An error loop preventing system to boot (as described in this
report). In this case I am not sure whether this is an infinite loop, or
if the system would boot after a long delay.

3. Boot is delayed by 18 minutes, during which time numerous errors are
thrown. After 18 minutes, OS boots fine.

4. System boots to text console, rather than the graphical login screen.
It is possible to log on to the console. Within seconds, xhci and/or
udev errors start appearing in the syslog. After two minutes, screen
goes blank, and the console seems unresponsive for another 16 minutes.
Following that, the graphical login screen appears, and from this point
system behaves fine.

5. As in 4, but after two minutes in the text console, incomplete
graphical login screen appears. Password box is missing and the
background is not fully loaded. After another 16 minutes, login screen
loads missing parts, and system behaves OK. In this case it is possible
to switch between text and graphical consoles during these 16 minutes,
but the graphical console becomes a purple empty screen after the
switch.

It is also worth noting that symptoms are highly dependent on the
external device(s) attached to RocketU's ports. Here is a summary:

1. No device connected to RocletU adapter - no problems during boot
2. USB3 flash drives (tested two models) - no problems during boot
3. Areca ARC-5040 enclosure - bug is triggered
4. WD MyPassport 2TB US 3.0 drive - bug is triggered
5. Transcend USB 3.0 SD card reader (TS-RDF5K) - bug is triggered with different symptoms: only a small delay (~15 seconds) and small number of xhci errors occur during boot, but the device does not work when OS is fully booted.

All the above devices work fine with "good" kernels. Note that I tested
three RocketU controllers and five Areca enclosures, to rule out the
possibility of a hardware problem on these devices.

With a variety of hardware reliably triggering the bug on "bad" kernels,
while working fine with "good" kernels, I think it is fully
substantiated to consider this regression as not hardware-dependent
(apart from the RocketU controller). I am changing tags as Christopher
requested in comment #13, but I would like to ask that this bug is
marked as duplicate of bug 1330530. That would allow me to debug the
issue on my test machines, which would be substantially easier than
doing it on production servers. I would prefer not to touch these
servers until the fix is released and verified on test computers.

** Tags added: kernel-fixed-upstream kernel-fixed-upstream-3.16

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1328984

Title:
  [Dell PowerEdge R510] Regression: Kernel 3.2.0-64 fails to boot with
  USB3 controller card

Status in “linux” package in Ubuntu:
  Incomplete

Bug description:
  A routine system update of Ubuntu 12.04 LTS to kernel 3.2.0-64
  resulted in unbootable system on two machines. Further testing
  revealed that kernel fails while initializing HighPoint RocketU 1144C
  USB 3.0 controller. This is a PCIe x4 add-in card that contains four
  USB 3.0 ports, each equipped with its own controller. The card did and
  does work without any problems with kernel 3.2.0-63 and earlier. Prior
  to installing kernel 3.2.0-64 there were neither hardware nor software
  problems with either of the machines.

  Steps to reproduce:
  apt-get dist-upgrade
  sync
  reboot
  Result: system fails to boot.

  The workaround is to revert to kernel 3.2.0-63 or to remove the
  RocketU card.

  Hardware description (same on both machines):
  Dell PowerEdge R510
  PERC6/i RAID controller
  64GB RAM DDR3 ECC registered
  Dual CPU: Intel Xeon X5660 2.80GHz
  HighPoint RocketU 1144C 4-Port USB 3.0 PCIe 2.0 x4 HBA

  Operating system (identical on both machines):
  Ubuntu 12.04.4 LTS
  Linux 3.2.0-64-generic x86_64

  Drives:
  sda - logical drive on PERC6/i, OS
  sdb - logical drive on PERC6/i, data
  sdc - Areca 5040 external RAID connected by USB3 to RocketU card
  sdd - Areca 5040 external RAID connected by USB3 to RocketU card
  sde - Areca 5040 external RAID connected by USB3 to RocketU card

  Symptoms:
  System boots normally until initialization of Areca drives connected to the RocketU card. The following messages are displayed on screen when booting without quiet and with debug options. These are last messages of a "typical" part of the boot sequence. Following it is a ~2 minute lag when no messages are displayed.

  [Please note that no trace of the boot progress gets recorded in
  system logs, and messages on screen scroll very fast. I had to record
  the boot progress with a high framerate camera, and even so some
  messages scrolled too fast and were not recorded. The following is a
  manual transcript of fragments of these videos; please forgive
  inevitable typos.]

  [5.621523] scsi 5:0:0:0: Direct-Access Areca Areca5  PQ: 0 ANSI: 5
  [5.622896] sd 5:0:0:0: Attached scsi generic sg4 type 0
  [5.623230] sd 5:0:0:0: [sdc] Very big device. Trying to use READ CAPACITY(16).
  [5.623668] sd 5:0:0:0: [sdc] 41015622144 512-byte logical blocks: (20.9 TB/19.0 TiB)
  [5.741152] scsi 6:0:0:0: Direct-Access Areca Areca3  PQ: 0 ANSI: 5
  [5.744003] sd 6:0:0:0: Attached scsi generic sg5 type 0
  [5.744545] sd 6:0:0:0: [sdd] Very big device. Trying to use READ CAPACITY(16).
  [5.744980] sd 6:0:0:0: [sdd] 41015622144 512-byte logical blocks: (20.9 TB/19.0 TiB)
  [6.004526] scsi76:0:0:0: Direct-Access Areca Areca7  PQ: 0 ANSI: 5
  [6.006121] sd 7:0:0:0: Attached scsi generic sg6 type 0
  [6.006488] sd 7:0:0:0: [sde] Very big device. Trying to use READ CAPACITY(16).
  [6.006834] sd 7:0:0:0: [sde] 35156217552 512-byte logical blocks: (17.9 TB/16.3 TiB)
  [7.133091] Adding 46874620k swap on /dev/sda3. Priority: -1 extents:1 across 46874620k

  After a two minute delay, the following messages appear in an infinite
  loop.  Please note that these messages appear in a somewhat random
  sequence, and not all messages appear on every boot. The only thing
  that works at this point is Ctrl-Alt-Delete.

  udevd[632]: timeout: killing '/sbin/modprobe -bv acpi:ACPI000D:PMP0C01:' [774]
  udevd[703]: timeout: killing '/sbin/modprobe -bv acpi:PMP0C014:' [776]
  udevd[529]: timeout: killing '/sbin/modprobe -bv input:b0003v0557p2261e0110-e0,1,2,3,4,k110,111,112,r8,a0,1,m4,lsfw' [1642]
  udevd[630]: timeout: killing '/sbin/modprobe -bv serio:ty06pr00id00ex00' [655]
  udevd[508]: timeout: killing '/sbin/modprobe -bv pci:v0000808640000342Esv00000000sd00000000bc00sc00i00' [512]
  udevd[494]: timeout: killing '/sbin/modprobe -bv input:b0019v0000p0001e0000-r0,1,k74,ramlsfw' [771]
  udevd[699]: timeout: killing '/sbin/modprobe -bv dmi:bvnDellInc.:bvr1.12.0:bd07/26/2013:svnDellInc.:pnPowerEdgeR510:pvr:rvnDellInc.:rm00HDP0:rvr002:cvnDellInc.:ct23:cvr:' [708]
  udevd[529]: timeout: killing '/sbin/modprobe -bv input:b0003v0557p2261e0110-e0,1,2,3,4,k71,72,73,74,77,80,82,83,85,86,87,88,89,8A,8B,8C,8E,8F,90,96,98,9B,9C,9E,9F,A1,A3,A4,A5,A6,A7,A8,A9,AB,AC,AD,AE,B1,B2,B5,CE,CF,D0,D1,D2,D4,D8,D9,DB,E4,EA,EB,F1,100,161,162,166,16A,16E,172,174,176,178,179,17A,17B,17C,17D,17F,180,182,182,185,188,189,18C,18D,18E,18F,190,191,192,193,195,198,199,19A,1A9,1A1,1A2,1A3,1A4,1A5,1A6,1A7,1A8,1A9,1AA,1AB,1AC,1AD,1AE,1B0,1B1,1B7,1BA,r6,a20,m4,lsfw' [1678]

  After pressing Ctrl-Alt-Delete, the above messages continue to appear
  for a few seconds, and after that the following messages are
  displayed:

  An error occurred while mounting /mnt/sdb.
  mountall: mount /mnt/sdb [1785] killed by KILL signal
  mountall: Filesystem could not be mounted: /mnt/sdb
   * Killing all remaining processes...  [Press
  S to skip mounting or M for manual recovery
  fail]
  rpcbind: rpcbind terminating on signal. Restart with "rpcbind -w"
   * Deconfiguring network interfaces  [ OK ]
   * Deactivating swap...  [ OK ]
   * Unmounting local filesystems...  [ OK ]
   * Will now restart
  [184.341144] hub 4-0:1.0: hub_port_status failed (err = -110)
  [184.341222] hub 4-0:1.0: hub_port_status failed (err = -110)
  [201.324536] usb 16-1: device not accepting address 2, error -62
  [201.380907] sd 7:0:0:0: [sde] Asking for cache data failed
  [201.380980] sd 7:0:0:0: [sde] Assuming drive cache: write through
  [201.381767] sd 7:0:0:0: [sde] Asking for cache data failed
  [201.381840] sd 7:0:0:0: [sde] Assuming drive cache: write through
  [201.382457] sd 7:0:0:0: [sde] Asking for cache data failed
  [201.382530] sd 7:0:0:0: [sde] Assuming drive cache: write through
  [211.880194] usb 12-1: device not accepting address 2, error -62
  [211.936396] sd 6:0:0:0: [sdd] Asking for cache data failed
  [211.936466] sd 6:0:0:0: [sdd] Assuming drive cache: write through
  [222.435967] usb 10-1: device not accepting address 2, error -62

  After the last message screen goes blank and machine reboots.

  Additional note:
  Not sure if this is related, but while looking for existing bug reports, I have found several posts about kernel 3.2.0-64 regressing in USB 3.0 support:
  https://bugs.launchpad.net/software-center/+bug/1328883
  http://www.linuxquestions.org/questions/linux-software-2/sudden-loss-of-usb-3-0-on-ubuntu-12-04-64-bit-kernel-3-2-0-64-generic-4175507335/

  Note about attachments:
  Due to kernel 3.2.0-64 not being able to boot, the attached command output was obtained using kernel 3.2.0-63.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1328984/+subscriptions