sts-sponsors team mailing list archive
-
sts-sponsors team
-
Mailing list archive
-
Message #05244
[Bug 2007746] Re: [SRU] xserver crashes when hyperv_drm kernel module is loaded on azure NV series instances w/ nvidia grid driver
** Description changed:
[ Impact ]
- * Microsoft Azure NV-series instances with NVidia GRID drivers started
+ * Microsoft Azure NV-series instances with NVidia GRID drivers started
to experience xserver crashes while following Microsoft's official guide
to installing Nvidia drivers [1].
- * Root cause analysis showed that it was due to having a device with
+ * Root cause analysis showed that it was due to having a device with
BusID "PCI:0@<domain_id>:0:0", where domain id is >= 32767 while the
hyperv_drm kernel module is loaded.
- * Removing either the BusID specification or unloading the hyperv_drm
+ * Removing either the BusID specification or unloading the hyperv_drm
kernel module seems to fix the crash.
- * The crash is happening while X.server is trying to enumerate PCI
+ * The crash is happening while X.server is trying to enumerate PCI
devices. X.server dereferences a NULL pointer while trying to access to
the PCI device info.
- * The reason why it only happens while the hyperv_drm kernel module is
+ * The reason why it only happens while the hyperv_drm kernel module is
loaded is that the hyperv_drm module does not expose PCI hardware
information since it's a virtual device.
- * The upstream patch [2] addresses the issue and it's confirmed that
+ * The upstream patch [2] addresses the issue and it's confirmed that
the xserver with the patch does not experience the crash.
- * Ubuntu Focal `xorg-server` package does not include the patch [2] at
+ * Ubuntu Focal `xorg-server` package does not include the patch [2] at
the moment (xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6).
- [1]: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-grid-drivers-on-nv-or-nvv3-series-vms
- [2]: https://github.com/freedesktop/xorg-xserver/commit/0d93bbfa2cfacbb73741f8bed0e32fa1a656b928
+ [1]: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-grid-drivers-on-nv-or-nvv3-series-vms
+ [2]: https://github.com/freedesktop/xorg-xserver/commit/0d93bbfa2cfacbb73741f8bed0e32fa1a656b928
[ Test Plan ]
Part (a) is quoted from Microsoft's official guide [1].
Part (a):
- * Spawn a Microsoft Azure NV-series instance with an NVidia GRID-supported GPU
- - e.g. `NV36adms A10`
- * Install updates, required tooling, and the desktop environment:
- - sudo apt-get update
- - sudo apt-get upgrade -y
- - sudo apt-get dist-upgrade -y
- - sudo apt-get install build-essential ubuntu-desktop -y
- - sudo apt-get install linux-azure -y
- * Disable nouveau kernel driver:
- # Create a blacklist file /etc/modprobe.d/nouveau.conf with following contents:
- blacklist nouveau
- blacklist lbm-nouveau
- * Reboot the VM, re-connect, and then stop X server:
- - sudo reboot
- # wait for the reboot, reconnect, and continue:
- - sudo systemctl stop lightdm.service
- * Download and install the NVidia GRID driver:
- - wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272
- - chmod +x NVIDIA-Linux-x86_64-grid.run
- - sudo ./NVIDIA-Linux-x86_64-grid.run
- - # When the setup asks whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.
- * Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf
- - sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
- * Edit /etc/nvidia/grid.conf
- - sudo nano /etc/nvidia/grid.conf
- # Append the following lines:
- IgnoreSP=FALSE
- EnableUI=FALSE
- # Remove this line if present:
- FeatureType=0
- # And save.
- * Reboot the VM
+ * Spawn a Microsoft Azure NV-series instance with an NVidia GRID-supported GPU
+ - e.g. `NV36adms A10`
+ * Install updates, required tooling, and the desktop environment:
+ - sudo apt-get update
+ - sudo apt-get upgrade -y
+ - sudo apt-get dist-upgrade -y
+ - sudo apt-get install build-essential ubuntu-desktop -y
+ - sudo apt-get install linux-azure -y
+ * Disable nouveau kernel driver:
+ # Create a blacklist file /etc/modprobe.d/nouveau.conf with following contents:
+ blacklist nouveau
+ blacklist lbm-nouveau
+ * Reboot the VM, re-connect, and then stop X server:
+ - sudo reboot
+ # wait for the reboot, reconnect, and continue:
+ - sudo systemctl stop lightdm.service
+ * Download and install the NVidia GRID driver:
+ - wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272
+ - chmod +x NVIDIA-Linux-x86_64-grid.run
+ - sudo ./NVIDIA-Linux-x86_64-grid.run
+ - # When the setup asks whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.
+ * Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf
+ - sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
+ * Edit /etc/nvidia/grid.conf
+ - sudo nano /etc/nvidia/grid.conf
+ # Append the following lines:
+ IgnoreSP=FALSE
+ EnableUI=FALSE
+ # Remove this line if present:
+ FeatureType=0
+ # And save.
+ * Reboot the VM
- Part (b):
+ Part (b):
- * Ensure that the hyperv_drm kernel module is loaded:
- - sudo modprobe hyperv_drm
- * Use the attached xorg.conf file to override /etc/X11/xorg.conf file
- * try to start the `xserver`:
- - sudo startx
- * `xserver` should crash with a similar output to the following:
- X.Org X Server 1.20.13
- X Protocol Version 11, Revision 0
- Build Operating System: linux Ubuntu
- Current Operating System: Linux a10test 5.15.0-1031-azure #38~20.04.1-Ubuntu SMP Mon Jan 9 18:23:48 UTC 2023 x86_64
- Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1031-azure root=PARTUUID=4cac852b-afba-447b-b3e7-c002155c1305 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 panic=-1
- Build Date: 07 February 2023 12:48:13PM
- xorg-server 2:1.20.13-1ubuntu1~20.04.6 (For technical support please see http://www.ubuntu.com/support)
- Current version of pixman: 0.38.4
- Before reporting problems, check http://wiki.x.org
- to make sure that you have the latest version.
- Markers: (--) probed, (**) from config file, (==) default setting,
- (++) from command line, (!!) notice, (II) informational,
- (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
- (==) Log file: "/var/log/Xorg.1.log", Time: Sat Feb 18 10:54:26 2023
- (==) Using config file: "/etc/X11/xorg.conf"
- (==) Using system config directory "/usr/share/X11/xorg.conf.d"
- (EE)
- (EE) Backtrace:
- (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55e7787c5ecc]
- (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7f9576cac420]
- (EE) 2: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0xa7) [0x55e7786c4db7]
- (EE) 3: /usr/lib/xorg/Xorg (xf86PlatformMatchDriver+0x700) [0x55e7786bf1b0]
- (EE) 4: /usr/lib/xorg/Xorg (xf86CallDriverProbe+0x5c) [0x55e7786971dc]
- (EE) 5: /usr/lib/xorg/Xorg (xf86BusConfig+0x43) [0x55e778697b23]
- (EE) 6: /usr/lib/xorg/Xorg (InitOutput+0x90b) [0x55e7786a59eb]
- (EE) 7: /usr/lib/xorg/Xorg (InitFonts+0x1d4) [0x55e778667ea4]
- (EE) 8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0x7f9576ac8083]
- (EE) 9: /usr/lib/xorg/Xorg (_start+0x2e) [0x55e778651ace]
- (EE)
- (EE) Segmentation fault at address 0x124
- (EE)
- Fatal server error:
- (EE) Caught signal 11 (Segmentation fault). Server aborting
- (EE)
- (EE)
- Please consult the The X.Org Foundation support
- at http://wiki.x.org
- for help.
- (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information.
- (EE)
- (EE) Server terminated with error (1). Closing log file.
- ^Cxinit: giving up
- xinit: unable to connect to X server: Connection refused
- xinit: unexpected signal 2
+ * Ensure that the hyperv_drm kernel module is loaded:
+ - sudo modprobe hyperv_drm
+ * Use the attached xorg.conf file to override /etc/X11/xorg.conf file
+ * try to start the `xserver`:
+ - sudo startx
+ * `xserver` should crash with a similar output to the following:
+ X.Org X Server 1.20.13
+ X Protocol Version 11, Revision 0
+ Build Operating System: linux Ubuntu
+ Current Operating System: Linux a10test 5.15.0-1031-azure #38~20.04.1-Ubuntu SMP Mon Jan 9 18:23:48 UTC 2023 x86_64
+ Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1031-azure root=PARTUUID=4cac852b-afba-447b-b3e7-c002155c1305 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 panic=-1
+ Build Date: 07 February 2023 12:48:13PM
+ xorg-server 2:1.20.13-1ubuntu1~20.04.6 (For technical support please see http://www.ubuntu.com/support)
+ Current version of pixman: 0.38.4
+ Before reporting problems, check http://wiki.x.org
+ to make sure that you have the latest version.
+ Markers: (--) probed, (**) from config file, (==) default setting,
+ (++) from command line, (!!) notice, (II) informational,
+ (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
+ (==) Log file: "/var/log/Xorg.1.log", Time: Sat Feb 18 10:54:26 2023
+ (==) Using config file: "/etc/X11/xorg.conf"
+ (==) Using system config directory "/usr/share/X11/xorg.conf.d"
+ (EE)
+ (EE) Backtrace:
+ (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55e7787c5ecc]
+ (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7f9576cac420]
+ (EE) 2: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0xa7) [0x55e7786c4db7]
+ (EE) 3: /usr/lib/xorg/Xorg (xf86PlatformMatchDriver+0x700) [0x55e7786bf1b0]
+ (EE) 4: /usr/lib/xorg/Xorg (xf86CallDriverProbe+0x5c) [0x55e7786971dc]
+ (EE) 5: /usr/lib/xorg/Xorg (xf86BusConfig+0x43) [0x55e778697b23]
+ (EE) 6: /usr/lib/xorg/Xorg (InitOutput+0x90b) [0x55e7786a59eb]
+ (EE) 7: /usr/lib/xorg/Xorg (InitFonts+0x1d4) [0x55e778667ea4]
+ (EE) 8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0x7f9576ac8083]
+ (EE) 9: /usr/lib/xorg/Xorg (_start+0x2e) [0x55e778651ace]
+ (EE)
+ (EE) Segmentation fault at address 0x124
+ (EE)
+ Fatal server error:
+ (EE) Caught signal 11 (Segmentation fault). Server aborting
+ (EE)
+ (EE)
+ Please consult the The X.Org Foundation support
+ at http://wiki.x.org
+ for help.
+ (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information.
+ (EE)
+ (EE) Server terminated with error (1). Closing log file.
+ ^Cxinit: giving up
+ xinit: unable to connect to X server: Connection refused
+ xinit: unexpected signal 2
+
+ # To verify patch fixes the issue:
+ * Enable the following PPA that includes the fix:
+ - sudo add-apt-repository ppa:mustafakemalgilor/lp2007746
+ - sudo apt update
+ * Install the package
+ - sudo apt install xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6ubuntu1
+ * Try to start xserver:
+ - sudo startx
+ * xserver should not crash.
+
[ Where problems could occur ]
- * The regression risk is low, given that the patch is well-isolated and
+ * The regression risk is low, given that the patch is well-isolated and
basically adds a null check that is already assumed to be there in the
first place.
[ Other Info ]
- * workaround #1: unload hyperv_drm kernel module:
- - sudo modprobe -r hyperv_drm
- * workaround #2: Comment out BusID line in /etc/X11/xorg.conf [Device] section:
- Section "Device"
- Identifier "Device0"
- Driver "nvidia"
- VendorName "NVIDIA Corporation"
- # BusID "PCI:0@32828:0:0"
- Option "HardDPMS" "false"
- Option "CustomEDID" "DFP-0:/etc/X11/vdisplay.edid"
- EndSection
+ * workaround #1: unload hyperv_drm kernel module:
+ - sudo modprobe -r hyperv_drm
+ * workaround #2: Comment out BusID line in /etc/X11/xorg.conf [Device] section:
+ Section "Device"
+ Identifier "Device0"
+ Driver "nvidia"
+ VendorName "NVIDIA Corporation"
+ # BusID "PCI:0@32828:0:0"
+ Option "HardDPMS" "false"
+ Option "CustomEDID" "DFP-0:/etc/X11/vdisplay.edid"
+ EndSection
--
You received this bug notification because you are a member of SE SRU
("STS") Sponsors, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/2007746
Title:
[SRU] xserver crashes when hyperv_drm kernel module is loaded on azure
NV series instances w/ nvidia grid driver
Status in xorg-server package in Ubuntu:
New
Status in xorg-server source package in Focal:
In Progress
Bug description:
[ Impact ]
* Microsoft Azure NV-series instances with NVidia GRID drivers
started to experience xserver crashes while following Microsoft's
official guide to installing Nvidia drivers [1].
* Root cause analysis showed that it was due to having a device with
BusID "PCI:0@<domain_id>:0:0", where domain id is >= 32767 while the
hyperv_drm kernel module is loaded.
* Removing either the BusID specification or unloading the hyperv_drm
kernel module seems to fix the crash.
* The crash is happening while X.server is trying to enumerate PCI
devices. X.server dereferences a NULL pointer while trying to access
to the PCI device info.
* The reason why it only happens while the hyperv_drm kernel module
is loaded is that the hyperv_drm module does not expose PCI hardware
information since it's a virtual device.
* The upstream patch [2] addresses the issue and it's confirmed that
the xserver with the patch does not experience the crash.
* Ubuntu Focal `xorg-server` package does not include the patch [2]
at the moment (xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6).
[1]: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-grid-drivers-on-nv-or-nvv3-series-vms
[2]: https://github.com/freedesktop/xorg-xserver/commit/0d93bbfa2cfacbb73741f8bed0e32fa1a656b928
[ Test Plan ]
Part (a) is quoted from Microsoft's official guide [1].
Part (a):
* Spawn a Microsoft Azure NV-series instance with an NVidia GRID-supported GPU
- e.g. `NV36adms A10`
* Install updates, required tooling, and the desktop environment:
- sudo apt-get update
- sudo apt-get upgrade -y
- sudo apt-get dist-upgrade -y
- sudo apt-get install build-essential ubuntu-desktop -y
- sudo apt-get install linux-azure -y
* Disable nouveau kernel driver:
# Create a blacklist file /etc/modprobe.d/nouveau.conf with following contents:
blacklist nouveau
blacklist lbm-nouveau
* Reboot the VM, re-connect, and then stop X server:
- sudo reboot
# wait for the reboot, reconnect, and continue:
- sudo systemctl stop lightdm.service
* Download and install the NVidia GRID driver:
- wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272
- chmod +x NVIDIA-Linux-x86_64-grid.run
- sudo ./NVIDIA-Linux-x86_64-grid.run
- # When the setup asks whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.
* Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf
- sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
* Edit /etc/nvidia/grid.conf
- sudo nano /etc/nvidia/grid.conf
# Append the following lines:
IgnoreSP=FALSE
EnableUI=FALSE
# Remove this line if present:
FeatureType=0
# And save.
* Reboot the VM
Part (b):
* Ensure that the hyperv_drm kernel module is loaded:
- sudo modprobe hyperv_drm
* Use the attached xorg.conf file to override /etc/X11/xorg.conf file
* try to start the `xserver`:
- sudo startx
* `xserver` should crash with a similar output to the following:
X.Org X Server 1.20.13
X Protocol Version 11, Revision 0
Build Operating System: linux Ubuntu
Current Operating System: Linux a10test 5.15.0-1031-azure #38~20.04.1-Ubuntu SMP Mon Jan 9 18:23:48 UTC 2023 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1031-azure root=PARTUUID=4cac852b-afba-447b-b3e7-c002155c1305 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 panic=-1
Build Date: 07 February 2023 12:48:13PM
xorg-server 2:1.20.13-1ubuntu1~20.04.6 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.38.4
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.1.log", Time: Sat Feb 18 10:54:26 2023
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(EE)
(EE) Backtrace:
(EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55e7787c5ecc]
(EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7f9576cac420]
(EE) 2: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0xa7) [0x55e7786c4db7]
(EE) 3: /usr/lib/xorg/Xorg (xf86PlatformMatchDriver+0x700) [0x55e7786bf1b0]
(EE) 4: /usr/lib/xorg/Xorg (xf86CallDriverProbe+0x5c) [0x55e7786971dc]
(EE) 5: /usr/lib/xorg/Xorg (xf86BusConfig+0x43) [0x55e778697b23]
(EE) 6: /usr/lib/xorg/Xorg (InitOutput+0x90b) [0x55e7786a59eb]
(EE) 7: /usr/lib/xorg/Xorg (InitFonts+0x1d4) [0x55e778667ea4]
(EE) 8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0x7f9576ac8083]
(EE) 9: /usr/lib/xorg/Xorg (_start+0x2e) [0x55e778651ace]
(EE)
(EE) Segmentation fault at address 0x124
(EE)
Fatal server error:
(EE) Caught signal 11 (Segmentation fault). Server aborting
(EE)
(EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
(EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information.
(EE)
(EE) Server terminated with error (1). Closing log file.
^Cxinit: giving up
xinit: unable to connect to X server: Connection refused
xinit: unexpected signal 2
# To verify patch fixes the issue:
* Enable the following PPA that includes the fix:
- sudo add-apt-repository ppa:mustafakemalgilor/lp2007746
- sudo apt update
* Install the package
- sudo apt install xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6ubuntu1
* Try to start xserver:
- sudo startx
* xserver should not crash.
[ Where problems could occur ]
* The regression risk is low, given that the patch is well-isolated
and basically adds a null check that is already assumed to be there in
the first place.
[ Other Info ]
* workaround #1: unload hyperv_drm kernel module:
- sudo modprobe -r hyperv_drm
* workaround #2: Comment out BusID line in /etc/X11/xorg.conf [Device] section:
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
# BusID "PCI:0@32828:0:0"
Option "HardDPMS" "false"
Option "CustomEDID" "DFP-0:/etc/X11/vdisplay.edid"
EndSection
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/2007746/+subscriptions