← Back to team overview

ubuntu-x-swat team mailing list archive

[Bug 398968] [NEW] [i945GM] X server gets stuck in loop waiting on DRI, reads mouse events but ignores all socket I/O hence appears frozen

 

Public bug reported:

Binary package hint: xserver-xorg-video-intel

This Intel freeze is one where the mouse kept working, and I was able to
login over SSH and do some diagnostics which point to DRI.  I have a
945GM chipset, on a laptop.

I was able to get some diagnostic output which might help identify the
cause.

Kernel DRI is implicated: it's stuck in the kernel, but interruptible
with a signal (in this type of freeze).  But after turning off DRI, the
Intel hardware is still in a stuck state, implying it's not necessarily
a bug in kernel DRI, just getting stuck due to an unexpected hardware
state, which X server initialisation does not clear.  I ran GDB on an X
server with DRI disabled in this state, and found which memory location
it is polling.

All my X locking up problems started with upgrade from Intrepid to
Jaunty, very soon after Jaunty was released.  I recall one lockup with
mouse movement soon after the upgrade, but most have been complete
freezes without even network access.  This was a rare one with mouse
working, so I took a closer look.

I am not using Compiz, just a standard Ubuntu 9.04 GNOME desktop, and
mostly just Firefox, Emacs and Gnome-Terminal.  Ubuntu's System >
Settings > Appearance > Visual Effects is set to None, and I'd not run
any 3d applications.


Freeze in ioctl(/dev/dri/card0, DRM_I915_GEM_THROTTLE), mouse still works
=========================================================================

This Intel freeze is one where the mouse kept working, and I was able to
login over SSH and do some diagnostics which point to DRI.

Since updating to Jaunty, I've seen freezes every few days or weeks on
my laptop while using X. Usually there is nothing I can do except power
cycle - no network, no keyboard response, Alt-SysRq not working.
Occasionally though, the mouse keeps working.

Kernel version is: 2.6.28-13-generic
CPU is 32-bit x86, a Core Duo 2.0GHz with 2.5GB RAM.
Graphics is Intel 945GM according to Xorg.0.log.
Kernel package:
   linux-image-2.6.28-13-generic   2.6.28-13.45
X packages:
   xserver-xorg-video-intel     2:2.6.3-0ubuntu9.3
   xserver-common               2:1.6.0-0ubuntu14
   xserver-xorg                 1:7.4~5ubuntu18
   xserver-xorg-core            2:1.6.0-0ubuntu14
   xserver-xorg-dev             2:1.6.0-0ubuntu14
   xorg                         1:7.4~5ubuntu18

GPU according to Xorg.log: Intel 945GM.

Like many people I've been having apparently random freezes with Jaunty
on Intel graphics h/w, and more than one kind of freeze.

This time, mouse movement worked: specifically, the pointer moved around
in response.  However, there was no visible reaction from clicking or
key presses or focus changes, and no other screen updates.  Just pointer
movement.

I straced the X server, and saw it blocked on ioctl(0x6458) on
/dev/dri/card0; I think that's DRM_I915_GEM_THROTTLE for this chip.

Mouse movements interrupted the ioctl() with a SIGIO signal.  It handled
SIGIO, reading some events, then it went back to calling the DRI
ioctl().  This must be why mouse movement worked.

See [X_DRI_freeze_strace.txt] for the strace, and [X_DRI_freeze_fds.txt]
for the file descriptors.

During SIGIO, it called select() on the /dev/input/eventN only.  It
didn't select on any of the socket descriptors.

Because it didn't check any sockets, all attempts to communicate with
the X server, including starting a new app from the text console, had no
effect on the strace; it remained blocked in the DRI ioctl().  So it's
not surprising there was no screen update from apps.

Clearly the problem was being stuck in the ioctl(11, 0x6458, 0), and
clearly that is not _intended_ to block for long.

That was /dev/dri/card0, and I think the ioctl is DRM_I915_GEM_THROTTLE
for this chip.

Inside the ioctl(), the Xorg process was waiting in kernel function
i915_wait_request(), according to "ps -opid,comm,wchan 3971".  The
numeric WCHAN is 0x6080e if that's helpful.  See
[X_DRI_freeze_wchan.txt].

It looks like the kernel is broken or the chip is stuck in some
"impossible" way.  Judging by the way the X server does not try to
handle sockets while this call is blocked, I guess it is not supposed to
block for long.

As last resort, I killed the server: "kill 3971" didn't kill it.  "kill
-HUP 3971" didn't kill it.  "kill -QUIT 3971" did.

The Xorg.0.log from this X server is [X_DRI_freeze_log.txt].  You can
see from the end that there's nothing interesting at the time it gets
stuck.


Once DRI is stuck, it stays stuck:
=================================

After killing it, gdm automatically started another X server.  The new X
server started well, judging by it's log.  But it only produced a black
screen (not even the X stipple background).  strace showed it stuck
waiting for the same ioctl() as before, on /dev/dri/card0...  gdm
eventually gave up waiting for it.

The Xorg.log from the second X server is [X2_DRI_frozen_startup.txt].


Disabling DRI with [Option "DRI" "no"] broke the X server
=========================================================

So I edited /etc/X11/xorg.conf, adding 'Option "DRI" "no"', and killed
the second X server with "kill -QUIT" to start a third.

The third X server failed to start.  It didn't even get stuck, it just
output some error messages and gave up.  Perhaps you can't disable DRI
with that option along (see next section); is that another bug to
report?

The Xorg.log from this X server is [X3_no_DRI_broken_hwstate.txt]

=> This log may be particularly useful because it says a few things
about the stuck hardware state.

  (WW) intel(0): ESR is 0x00000001, instruction error
  (WW) intel(0): PRB0_CTL (0x0001f001) indicates ring buffer enabled
  (WW) intel(0): PRB0_HEAD (0x00000000) and PRB0_TAIL (0x00000050) indicate ring buffer not flushed
  (WW) intel(0): Existing errors found in hardware state.
  Error in I830WaitLpRing(), timeout for 2 seconds
  pgetbl_ctl: 0x9ffc0001 getbl_err: 0x00000000
  ipeir: 0x00000000 iphdr: 0x54300004
  LP ring tail: 0x00000050 head: 0x00000000 len: 0x0001f001 start 0x007bf000
  eir: 0x0000 esr: 0x0001 emr: 0xffff
  instdone: 0x4081 instpm: 0x0000
  memmode: 0x00000301 instps: 0x800f0044
  hwstam: 0xffff ier: 0x0000 imr: 0xffff iir: 0x0000
  Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20 acthd 0x9c764c
  Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20
(... above line repeated exactly 64 times ...)
          0001fffc: 00000000      MI_NOOP                                  1
  Ring end
  space: 130984 wanted 131064
  
  Fatal server error:
  lockup
  
  
  Please consult the The X.Org Foundation support 
           at http://wiki.x.org
   for help. 
  Please also check the log file at "/var/log/Xorg.0.log" for additional information.
  
  Error in I830WaitLpRing(), timeout for 2 seconds
  pgetbl_ctl: 0x9ffc0001 getbl_err: 0x00000000
  ipeir: 0x00000000 iphdr: 0x54300004
  LP ring tail: 0x00000050 head: 0x00000000 len: 0x0001f001 start 0x007bf000
  eir: 0x0000 esr: 0x0001 emr: 0xffff
  instdone: 0x4081 instpm: 0x0000
  memmode: 0x00000301 instps: 0x800f0044
  hwstam: 0xffff ier: 0x0000 imr: 0xffff iir: 0x0000
  Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20 acthd 0x9c764c
  Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20
(... above line repeated exactly 64 times ...)
          0001fffc: 00000000      MI_NOOP                                  1
  Ring end
  space: 130984 wanted 131064
  
  FatalError re-entered, aborting
  lockup


Disabling DRI even harder
=========================

'Option "DRI" "no"' didn't work, so I looked for anything else DRI-
related in the xorg.conf, in case that helped disable it properly.

That resulted in commenting out these lines, which came with it by
default:

(in section "Module")
#	Load	"dri"

(at the end)
#Section "DRI"
#	Mode	0666
#EndSection

And then X was able to start without exiting quickly.  Does that mean
the 'Option' didn't disable DRI?  Perhaps I should report this as a bug
too?

This one started, but still didn't actually work.  This time it got
stuck in 100% CPU in userspace.

The log from this one is [X4_no_DRI_stuck_in_loop.txt].

Because it was stuck in userspace, I looked at it with GDB.

GDB said X was stuck inside intel_drv.so, in a short loop like this:

  0xb78b1830:     mov    (%edx),%eax
  0xb78b1832:     test   %eax,%eax
  0xb78b1834:     jns    0xb78b1830

  %edx = 0xb785f200, and %eax = 0x0000000a.

It's polling the word at address 0xb785f200, waiting for bit 31 to
become set.

So I looked up that address in /proc/26213/maps, while the process was
still being debugged:

  b77fe000-b787e000 rw-s dc100000 00:00 7477
/sys/devices/pci0000:00/0000:00:02.0/resource0

The offset is 0xb785f200-0xb77fe000 == 0x61200.

The corresponding PCI device is:

  00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03)
          Subsystem: Fujitsu Siemens Computers Device 10ad
          Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
          Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
          Latency: 0
          Interrupt: pin A routed to IRQ 16
          Region 0: Memory at dc100000 (32-bit, non-prefetchable) [size=512K]
          Region 1: I/O ports at 1800 [size=8]
          Region 2: Memory at c0000000 (32-bit, prefetchable) [size=256M]
          Region 3: Memory at dc200000 (32-bit, non-prefetchable) [size=256K]
          Capabilities: [90] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable-
                  Address: 00000000  Data: 0000
          Capabilities: [d0] Power Management version 2
                  Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                  Status: D0 PME-Enable- DSel=0 DScale=0 PME-
          Kernel modules: intelfb

So we conclude that with all of DRI disabled, intel_drv.so gets stuck
polling the word at offset 0x61200 from PCI region 0 of a 945GM, waiting
for bit 31 to become set.  When stuck, it reads the value 0x0a every
time, when stopped and viewed in GDB.

That register is called PP_STATUS in i810_reg.h, bit 31 is PP_ON; that
file doesn't say what 0x0a means.

I don't know if it's a different register on 945GM.  I only found an
obvious definition in i810_reg.h.

Since this is with all of DRI disabled in this (fourth) run of X, it
might be unrelated to the original lockup.  It might be stuck due to
some other part of the video card in an improper state due to previous X
servers being killed.  In particular, PP_STATUS is related to "panel
power" as far as I can tell, and nothing to do with DRI - unless DRI
gets stuck waiting for vsync or something like that.

But it's quite suspicious.


Afterwards
==========

After rebooting, X starts with the same settings (DRI completely
disabled) and works fine.  I've been using it like this ever since, and
haven't had any lockups.  That's 6 days ago.

They were quite variable before, only happening every few days or after
about 2 weeks, and I haven't used the computer a lot since the bug was
noticed.

If I don't get any lockups for a month with the new settings, we can be
more confident that this one needs DRI enabled to trigger it.

Hope the debugging is useful, or even better if it's already fixed
upstream :-)

** Affects: xserver-xorg-video-intel (Ubuntu)
     Importance: Undecided
         Status: New

-- 
[i945GM] X server gets stuck in loop waiting on DRI, reads mouse events but ignores all socket I/O hence appears frozen
https://bugs.launchpad.net/bugs/398968
You received this bug notification because you are a member of Ubuntu-X,
which is subscribed to xserver-xorg-video-intel in ubuntu.



Follow ups

References