← Back to team overview

kernel-packages team mailing list archive

[Bug 1009312] Re: 10de:0426 GPU loads unreliably, possible kernel timeout

 

It's been a while, but I've found the time to dig much deeper into this
and familiarize myself with the kernel code some. Actually, I feel
comfortable with the idea of directly contacting the appropriate mailing
list now so this is more to keep the record up-to-date than a request
for more triage.

Anyways, after just walking through the kernel code, I first realized
that the first sign of the bug (the 30ms gap) was occurring somewhere
within the function pci_scan_child_bus (in drivers/pci/probe.c), between
when it invokes the function pci_scan_slot (also in drivers/pci/probe.c)
and the function pcibios_fixup_bus (in my case, under
arch/x86/pci/common.c)

>From there, I began adding dev_info statements around function calls
that would be executed in between, then looked between whichever 2
messages the gap occurred between to further narrow down the problem.
After a few rounds of this, I found the delay consistently appearing
within the function pcie_aspm_configure_common_clock (in
drivers/pci/pcie/aspm.c) After a little research about what the PCIe
common clock is about, it actually explains several aspects of this bug.
Booting the computer from battery power would influence the power state
of the device, which is what ASPM is all about. And it turns out the
discrepancy of 24ms between a good boot and a bad boot is precisely the
length of time the PCIe standard defines as a timeout for link training.

Unfortunately, I don't know how, or even if, the two commits I found
earlier directly tie into this. It seems there's a really weird race
condition or resource fight going on. I'm not exactly sure how to fix
the problem clearly either because just adding the overhead of dev_info
statements to the function makes the bug go away (so I can technically
"fix" the bug, but that's just a total hack). The one other little cliue
I found was that the delay went away completely when I put dev_info
statements in every possible branch of the function's logic. When I only
added dev_info to the ifs corresponding to a problem though, a slight
delay appeared (bumping the total time in the function to around 10ms),
but still not enough for link training to timeout (so my GPU always
loaded).

I plan on mailing the list for the PCI subsystem of the kernel soon, but
I'm stumped about how exactly to proceed so if you have any debugging
suggestions, I'd be happy to hear them. Thanks again.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1009312

Title:
  10de:0426 GPU loads unreliably, possible kernel timeout

Status in “linux” package in Ubuntu:
  Triaged

Bug description:
  Reverse upstream kernel commit bisecting revealed a fix via commit
  d34883d4e35c0a994e91dd847a82b4c9e0c31d83 by Xiao Guangrong.

  WORKAROUND: If I boot my computer from battery power alone without AC,
  my GPU & the Ubuntu splash screen load on startup.

  I've been running Ubuntu 12.04 for a few weeks now, I really like it,
  but from the beginning, I had the issue where the proprietary nvidia
  driver installs but fails to load (confirmed from the commandline,
  jockey, and the nvidia-dashboard). Over time, I've noticed that
  sometimes when I power on, the driver does load and I can enter a full
  unity session without problems, but other times, I fall back onto the
  VESA driver and a unity 2d session. On a whim, I finally copied logs
  from both successful and unsuccessful boots, cut out the times, ran a
  diff on them, and noticed a pattern in the kernel messages.

  I'm filing this bug after a successful boot so I've also attached
  copies of dmesg, Xorg, & jockey logs from an unsuccessful boot. The
  first thing I saw in the logs was a timing discrepancy between the two
  boots, most of which is due to GPE storms. I've checked other logs and
  there's not a clear relation, I've had successful boots with them and
  unsuccessful ones without them. I do still wonder if they may be
  involved because it seems I'm a little luckier if I turn off and
  unplug any peripherals before booting.

  But around line 325 in my dmesg logs, at the last step that mentions
  my GPU (pci device 0000:01:00.0), there is consistently at most a 6 ms
  delay for successful boots, but a 30 ms one for unsuccessful ones.
  Also, on all dmesg logs from successful boots, around line 610, the
  message "Boot video device" is recorded for the PCI number of my GPU,
  but for every fallback, the message never appears. That's why I'm
  thinking it's a kernel issue because the earliest mention of a
  specific driver module doesn't occur until later in the log.

  I'm currently using fully updated versions of nvidia driver 295.49.

  ProblemType: Bug
  DistroRelease: Ubuntu 12.04
  Package: linux-image-3.2.0-24-generic-pae 3.2.0-24.39
  ProcVersionSignature: Ubuntu 3.2.0-24.39-generic-pae 3.2.16
  Uname: Linux 3.2.0-24-generic-pae i686
  NonfreeKernelModules: nvidia
  AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
  ApportVersion: 2.0.1-0ubuntu8
  Architecture: i386
  ArecordDevices:
   **** List of CAPTURE Hardware Devices ****
   card 0: Intel [HDA Intel], device 0: STAC92xx Analog [STAC92xx Analog]
     Subdevices: 1/1
     Subdevice #0: subdevice #0
  AudioDevicesInUse:
   USER        PID ACCESS COMMAND
   /dev/snd/controlC0:  kyle       1790 F.... pulseaudio
  Card0.Amixer.info:
   Card hw:0 'Intel'/'HDA Intel at 0xfc400000 irq 48'
     Mixer name	: 'SigmaTel STAC9872AK'
     Components	: 'HDA:83847662,104d1c00,00100201 HDA:14f12c06,104d1700,00100000'
     Controls      : 18
     Simple ctrls  : 9
  Date: Tue Jun  5 22:44:22 2012
  EcryptfsInUse: Yes
  HibernationDevice: RESUME=UUID=1b676222-44c7-453c-a522-06b6fd5d66f4
  InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Release i386 (20120423)
  MachineType: Sony Corporation VGN-FZ260E
  PccardctlIdent:
   Socket 0:
     no product info available
  PccardctlStatus:
   Socket 0:
     no card
  ProcEnviron:
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB: 0 VESA VGA
  ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.2.0-24-generic-pae root=UUID=e330e46a-b426-439f-8037-c1069cc693ce ro quiet splash vt.handoff=7
  RelatedPackageVersions:
   linux-restricted-modules-3.2.0-24-generic-pae N/A
   linux-backports-modules-3.2.0-24-generic-pae  N/A
   linux-firmware                                1.79
  RfKill:
   0: phy0: Wireless LAN
    Soft blocked: no
    Hard blocked: no
  SourcePackage: linux
  UpgradeStatus: No upgrade log present (probably fresh install)
  dmi.bios.date: 07/04/2007
  dmi.bios.vendor: Phoenix Technologies LTD
  dmi.bios.version: R1120J7
  dmi.board.asset.tag: N/A
  dmi.board.name: VAIO
  dmi.board.vendor: Sony Corporation
  dmi.board.version: N/A
  dmi.chassis.asset.tag: N/A
  dmi.chassis.type: 10
  dmi.chassis.vendor: Sony Corporation
  dmi.chassis.version: N/A
  dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvrR1120J7:bd07/04/2007:svnSonyCorporation:pnVGN-FZ260E:pvrFC000001:rvnSonyCorporation:rnVAIO:rvrN/A:cvnSonyCorporation:ct10:cvrN/A:
  dmi.product.name: VGN-FZ260E
  dmi.product.version: FC000001
  dmi.sys.vendor: Sony Corporation
  ---
  AcpiTables: Error: command ['pkexec', '/usr/share/apport/dump_acpi_tables.py'] failed with exit code 127: Error executing /usr/share/apport/dump_acpi_tables.py: Permission denied
  ApportVersion: 2.5.1-0ubuntu4
  Architecture: i386
  AudioDevicesInUse:
   USER        PID ACCESS COMMAND
   /dev/snd/controlC0:  ubuntu     3344 F.... pulseaudio
  CasperVersion: 1.321
  DistroRelease: Ubuntu 12.10
  LiveMediaBuild: Ubuntu 12.10 "Quantal Quetzal" - Alpha i386 (20120831)
  MachineType: Sony Corporation VGN-FZ260E
  Package: linux (not installed)
  PccardctlIdent:
   Socket 0:
     no product info available
  PccardctlStatus:
   Socket 0:
     no card
  ProcEnviron:
   TERM=xterm
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB:

  ProcKernelCmdLine: noprompt cdrom-detect/try-usb=true file=/cdrom/preseed/username.seed boot=casper initrd=/casper/initrd.lz quiet splash -- maybe-ubiquity
  ProcVersionSignature: Ubuntu 3.5.0-13.14-generic 3.5.3
  RelatedPackageVersions:
   linux-restricted-modules-3.5.0-13-generic N/A
   linux-backports-modules-3.5.0-13-generic  N/A
   linux-firmware                            1.91
  RfKill:
   0: phy0: Wireless LAN
    Soft blocked: no
    Hard blocked: yes
  Tags:  quantal running-unity
  Uname: Linux 3.5.0-13-generic i686
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
  dmi.bios.date: 07/04/2007
  dmi.bios.vendor: Phoenix Technologies LTD
  dmi.bios.version: R1120J7
  dmi.board.asset.tag: N/A
  dmi.board.name: VAIO
  dmi.board.vendor: Sony Corporation
  dmi.board.version: N/A
  dmi.chassis.asset.tag: N/A
  dmi.chassis.type: 10
  dmi.chassis.vendor: Sony Corporation
  dmi.chassis.version: N/A
  dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvrR1120J7:bd07/04/2007:svnSonyCorporation:pnVGN-FZ260E:pvrFC000001:rvnSonyCorporation:rnVAIO:rvrN/A:cvnSonyCorporation:ct10:cvrN/A:
  dmi.product.name: VGN-FZ260E
  dmi.product.version: FC000001
  dmi.sys.vendor: Sony Corporation
  ---
  ApportVersion: 2.10.2-0ubuntu1
  Architecture: i386
  AudioDevicesInUse:
   USER        PID ACCESS COMMAND
   /dev/snd/controlC0:  ubuntu     4176 F.... pulseaudio
                        ubuntu     6045 F.... pulseaudio
  CasperVersion: 1.333
  DistroRelease: Ubuntu 13.10
  LiveMediaBuild: Ubuntu 13.10 "Saucy Salamander" - Alpha i386 (20130529)
  MachineType: Sony Corporation VGN-FZ260E
  MarkForUpload: True
  Package: linux (not installed)
  PccardctlIdent:
   Socket 0:
     no product info available
  PccardctlStatus:
   Socket 0:
     no card
  ProcEnviron:
   LANGUAGE=en_US
   TERM=xterm
   PATH=(custom, no user)
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  ProcFB:

  ProcKernelCmdLine: noprompt cdrom-detect/try-usb=true persistent file=/cdrom/preseed/hostname.seed boot=casper initrd=/casper/initrd.lz quiet splash -- maybe-ubiquity
  ProcVersionSignature: Ubuntu 3.9.0-3.8-generic 3.9.4
  PulseList:
   Error: command ['pacmd', 'list'] failed with exit code 1: Home directory not accessible: Permission denied
   No PulseAudio daemon running, or not running as session daemon.
  RelatedPackageVersions:
   linux-restricted-modules-3.9.0-3-generic N/A
   linux-backports-modules-3.9.0-3-generic  N/A
   linux-firmware                           1.109
  RfKill:
   0: phy0: Wireless LAN
    Soft blocked: no
    Hard blocked: no
  Tags:  saucy
  Uname: Linux 3.9.0-3-generic i686
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:

  dmi.bios.date: 07/04/2007
  dmi.bios.vendor: Phoenix Technologies LTD
  dmi.bios.version: R1120J7
  dmi.board.asset.tag: N/A
  dmi.board.name: VAIO
  dmi.board.vendor: Sony Corporation
  dmi.board.version: N/A
  dmi.chassis.asset.tag: N/A
  dmi.chassis.type: 10
  dmi.chassis.vendor: Sony Corporation
  dmi.chassis.version: N/A
  dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvrR1120J7:bd07/04/2007:svnSonyCorporation:pnVGN-FZ260E:pvrFC000001:rvnSonyCorporation:rnVAIO:rvrN/A:cvnSonyCorporation:ct10:cvrN/A:
  dmi.product.name: VGN-FZ260E
  dmi.product.version: FC000001
  dmi.sys.vendor: Sony Corporation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1009312/+subscriptions