← Back to team overview

sts-sponsors team mailing list archive

[Bug 2007746] Re: [SRU] xserver crashes when hyperv_drm kernel module is loaded on azure NV series instances w/ nvidia grid driver

 

** Description changed:

  [ Impact ]
  
-  * Microsoft Azure NV-series instances with NVidia GRID drivers started
+  * Microsoft Azure NV-series instances with NVidia GRID drivers started
  to experience xserver crashes while following Microsoft's official guide
  to installing Nvidia drivers [1].
  
-  * Root cause analysis showed that it was due to having a device with
+  * Root cause analysis showed that it was due to having a device with
  BusID "PCI:0@<domain_id>:0:0", where domain id is >= 32767 while the
  hyperv_drm kernel module is loaded.
  
-  * Removing either the BusID specification or unloading the hyperv_drm
+  * Removing either the BusID specification or unloading the hyperv_drm
  kernel module seems to fix the crash.
  
-  * The crash is happening while X.server is trying to enumerate PCI
+  * The crash is happening while X.server is trying to enumerate PCI
  devices. X.server dereferences a NULL pointer while trying to access to
  the PCI device info.
  
-  * The reason why it only happens while the hyperv_drm kernel module is
+  * The reason why it only happens while the hyperv_drm kernel module is
  loaded is that the hyperv_drm module does not expose PCI hardware
  information since it's a virtual device.
  
-  * The upstream patch [2] addresses the issue and it's confirmed that
+  * The upstream patch [2] addresses the issue and it's confirmed that
  the xserver with the patch does not experience the crash.
  
-  * Ubuntu Focal `xorg-server` package does not include the patch [2] at
+  * Ubuntu Focal `xorg-server` package does not include the patch [2] at
  the moment (xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6).
  
-  [1]: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-grid-drivers-on-nv-or-nvv3-series-vms
-  [2]: https://github.com/freedesktop/xorg-xserver/commit/0d93bbfa2cfacbb73741f8bed0e32fa1a656b928
+  [1]: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-grid-drivers-on-nv-or-nvv3-series-vms
+  [2]: https://github.com/freedesktop/xorg-xserver/commit/0d93bbfa2cfacbb73741f8bed0e32fa1a656b928
  
  [ Test Plan ]
  
  Part (a) is quoted from Microsoft's official guide [1].
  
  Part (a):
  
-  * Spawn a Microsoft Azure NV-series instance with an NVidia GRID-supported GPU
-    - e.g. `NV36adms A10`
-  * Install updates, required tooling, and the desktop environment:
-    - sudo apt-get update
-    - sudo apt-get upgrade -y
-    - sudo apt-get dist-upgrade -y
-    - sudo apt-get install build-essential ubuntu-desktop -y
-    - sudo apt-get install linux-azure -y
-  * Disable nouveau kernel driver:
-    # Create a blacklist file /etc/modprobe.d/nouveau.conf with following contents:
-    blacklist nouveau
-    blacklist lbm-nouveau 
-  * Reboot the VM, re-connect, and then stop X server:
-    - sudo reboot
-    # wait for the reboot, reconnect, and continue:
-    - sudo systemctl stop lightdm.service
-  * Download and install the NVidia GRID driver:
-    - wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272 
-    - chmod +x NVIDIA-Linux-x86_64-grid.run
-    - sudo ./NVIDIA-Linux-x86_64-grid.run
-    - # When the setup asks whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.
-  * Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf
-    - sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
-  * Edit /etc/nvidia/grid.conf
-    - sudo nano /etc/nvidia/grid.conf
-    # Append the following lines:
-    IgnoreSP=FALSE
-    EnableUI=FALSE
-    # Remove this line if present:
-    FeatureType=0
-    # And save.
-  * Reboot the VM
+  * Spawn a Microsoft Azure NV-series instance with an NVidia GRID-supported GPU
+    - e.g. `NV36adms A10`
+  * Install updates, required tooling, and the desktop environment:
+    - sudo apt-get update
+    - sudo apt-get upgrade -y
+    - sudo apt-get dist-upgrade -y
+    - sudo apt-get install build-essential ubuntu-desktop -y
+    - sudo apt-get install linux-azure -y
+  * Disable nouveau kernel driver:
+    # Create a blacklist file /etc/modprobe.d/nouveau.conf with following contents:
+    blacklist nouveau
+    blacklist lbm-nouveau
+  * Reboot the VM, re-connect, and then stop X server:
+    - sudo reboot
+    # wait for the reboot, reconnect, and continue:
+    - sudo systemctl stop lightdm.service
+  * Download and install the NVidia GRID driver:
+    - wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272
+    - chmod +x NVIDIA-Linux-x86_64-grid.run
+    - sudo ./NVIDIA-Linux-x86_64-grid.run
+    - # When the setup asks whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.
+  * Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf
+    - sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
+  * Edit /etc/nvidia/grid.conf
+    - sudo nano /etc/nvidia/grid.conf
+    # Append the following lines:
+    IgnoreSP=FALSE
+    EnableUI=FALSE
+    # Remove this line if present:
+    FeatureType=0
+    # And save.
+  * Reboot the VM
  
-  Part (b):
+  Part (b):
  
-   * Ensure that the hyperv_drm kernel module is loaded:
-     - sudo modprobe hyperv_drm 
-   * Use the attached xorg.conf file to override /etc/X11/xorg.conf file
-   * try to start the `xserver`:
-     - sudo startx
-   * `xserver` should crash with a similar output to the following:
-   X.Org X Server 1.20.13
-   X Protocol Version 11, Revision 0
-   Build Operating System: linux Ubuntu
-   Current Operating System: Linux a10test 5.15.0-1031-azure #38~20.04.1-Ubuntu SMP Mon Jan 9 18:23:48 UTC 2023 x86_64
-   Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1031-azure root=PARTUUID=4cac852b-afba-447b-b3e7-c002155c1305 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 panic=-1
-   Build Date: 07 February 2023  12:48:13PM
-   xorg-server 2:1.20.13-1ubuntu1~20.04.6 (For technical support please see http://www.ubuntu.com/support) 
-   Current version of pixman: 0.38.4
-     Before reporting problems, check http://wiki.x.org
-     to make sure that you have the latest version.
-   Markers: (--) probed, (**) from config file, (==) default setting,
-     (++) from command line, (!!) notice, (II) informational,
-     (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
-   (==) Log file: "/var/log/Xorg.1.log", Time: Sat Feb 18 10:54:26 2023
-   (==) Using config file: "/etc/X11/xorg.conf"
-   (==) Using system config directory "/usr/share/X11/xorg.conf.d"
-   (EE) 
-   (EE) Backtrace:
-   (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55e7787c5ecc]
-   (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7f9576cac420]
-   (EE) 2: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0xa7) [0x55e7786c4db7]
-   (EE) 3: /usr/lib/xorg/Xorg (xf86PlatformMatchDriver+0x700) [0x55e7786bf1b0]
-   (EE) 4: /usr/lib/xorg/Xorg (xf86CallDriverProbe+0x5c) [0x55e7786971dc]
-   (EE) 5: /usr/lib/xorg/Xorg (xf86BusConfig+0x43) [0x55e778697b23]
-   (EE) 6: /usr/lib/xorg/Xorg (InitOutput+0x90b) [0x55e7786a59eb]
-   (EE) 7: /usr/lib/xorg/Xorg (InitFonts+0x1d4) [0x55e778667ea4]
-   (EE) 8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0x7f9576ac8083]
-   (EE) 9: /usr/lib/xorg/Xorg (_start+0x2e) [0x55e778651ace]
-   (EE) 
-   (EE) Segmentation fault at address 0x124
-   (EE) 
-   Fatal server error:
-   (EE) Caught signal 11 (Segmentation fault). Server aborting
-   (EE) 
-   (EE) 
-   Please consult the The X.Org Foundation support 
-      at http://wiki.x.org
-    for help. 
-   (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information.
-   (EE) 
-   (EE) Server terminated with error (1). Closing log file.
-   ^Cxinit: giving up
-   xinit: unable to connect to X server: Connection refused
-   xinit: unexpected signal 2
+   * Ensure that the hyperv_drm kernel module is loaded:
+     - sudo modprobe hyperv_drm
+   * Use the attached xorg.conf file to override /etc/X11/xorg.conf file
+   * try to start the `xserver`:
+     - sudo startx
+   * `xserver` should crash with a similar output to the following:
+   X.Org X Server 1.20.13
+   X Protocol Version 11, Revision 0
+   Build Operating System: linux Ubuntu
+   Current Operating System: Linux a10test 5.15.0-1031-azure #38~20.04.1-Ubuntu SMP Mon Jan 9 18:23:48 UTC 2023 x86_64
+   Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1031-azure root=PARTUUID=4cac852b-afba-447b-b3e7-c002155c1305 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 panic=-1
+   Build Date: 07 February 2023  12:48:13PM
+   xorg-server 2:1.20.13-1ubuntu1~20.04.6 (For technical support please see http://www.ubuntu.com/support)
+   Current version of pixman: 0.38.4
+     Before reporting problems, check http://wiki.x.org
+     to make sure that you have the latest version.
+   Markers: (--) probed, (**) from config file, (==) default setting,
+     (++) from command line, (!!) notice, (II) informational,
+     (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
+   (==) Log file: "/var/log/Xorg.1.log", Time: Sat Feb 18 10:54:26 2023
+   (==) Using config file: "/etc/X11/xorg.conf"
+   (==) Using system config directory "/usr/share/X11/xorg.conf.d"
+   (EE)
+   (EE) Backtrace:
+   (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55e7787c5ecc]
+   (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7f9576cac420]
+   (EE) 2: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0xa7) [0x55e7786c4db7]
+   (EE) 3: /usr/lib/xorg/Xorg (xf86PlatformMatchDriver+0x700) [0x55e7786bf1b0]
+   (EE) 4: /usr/lib/xorg/Xorg (xf86CallDriverProbe+0x5c) [0x55e7786971dc]
+   (EE) 5: /usr/lib/xorg/Xorg (xf86BusConfig+0x43) [0x55e778697b23]
+   (EE) 6: /usr/lib/xorg/Xorg (InitOutput+0x90b) [0x55e7786a59eb]
+   (EE) 7: /usr/lib/xorg/Xorg (InitFonts+0x1d4) [0x55e778667ea4]
+   (EE) 8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0x7f9576ac8083]
+   (EE) 9: /usr/lib/xorg/Xorg (_start+0x2e) [0x55e778651ace]
+   (EE)
+   (EE) Segmentation fault at address 0x124
+   (EE)
+   Fatal server error:
+   (EE) Caught signal 11 (Segmentation fault). Server aborting
+   (EE)
+   (EE)
+   Please consult the The X.Org Foundation support
+      at http://wiki.x.org
+    for help.
+   (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information.
+   (EE)
+   (EE) Server terminated with error (1). Closing log file.
+   ^Cxinit: giving up
+   xinit: unable to connect to X server: Connection refused
+   xinit: unexpected signal 2
+ 
+ # To verify patch fixes the issue:
+ * Enable the following PPA that includes the fix: 
+   - sudo add-apt-repository ppa:mustafakemalgilor/lp2007746
+   - sudo apt update
+ * Install the package
+   - sudo apt install xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6ubuntu1
+ * Try to start xserver:
+   - sudo startx
+ * xserver should not crash.
+    
  
  [ Where problems could occur ]
  
-  * The regression risk is low, given that the patch is well-isolated and
+  * The regression risk is low, given that the patch is well-isolated and
  basically adds a null check that is already assumed to be there in the
  first place.
  
  [ Other Info ]
  
-  * workaround #1: unload hyperv_drm kernel module:
-    - sudo modprobe -r hyperv_drm
-  * workaround #2: Comment out BusID line in /etc/X11/xorg.conf [Device] section:
-    Section "Device"
-       Identifier     "Device0"
-       Driver         "nvidia"
-       VendorName     "NVIDIA Corporation"
-       # BusID          "PCI:0@32828:0:0"
-       Option         "HardDPMS" "false"
-       Option         "CustomEDID" "DFP-0:/etc/X11/vdisplay.edid"
-    EndSection
+  * workaround #1: unload hyperv_drm kernel module:
+    - sudo modprobe -r hyperv_drm
+  * workaround #2: Comment out BusID line in /etc/X11/xorg.conf [Device] section:
+    Section "Device"
+       Identifier     "Device0"
+       Driver         "nvidia"
+       VendorName     "NVIDIA Corporation"
+       # BusID          "PCI:0@32828:0:0"
+       Option         "HardDPMS" "false"
+       Option         "CustomEDID" "DFP-0:/etc/X11/vdisplay.edid"
+    EndSection

-- 
You received this bug notification because you are a member of SE SRU
("STS") Sponsors, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/2007746

Title:
  [SRU] xserver crashes when hyperv_drm kernel module is loaded on azure
  NV series instances w/ nvidia grid driver

Status in xorg-server package in Ubuntu:
  New
Status in xorg-server source package in Focal:
  In Progress

Bug description:
  [ Impact ]

   * Microsoft Azure NV-series instances with NVidia GRID drivers
  started to experience xserver crashes while following Microsoft's
  official guide to installing Nvidia drivers [1].

   * Root cause analysis showed that it was due to having a device with
  BusID "PCI:0@<domain_id>:0:0", where domain id is >= 32767 while the
  hyperv_drm kernel module is loaded.

   * Removing either the BusID specification or unloading the hyperv_drm
  kernel module seems to fix the crash.

   * The crash is happening while X.server is trying to enumerate PCI
  devices. X.server dereferences a NULL pointer while trying to access
  to the PCI device info.

   * The reason why it only happens while the hyperv_drm kernel module
  is loaded is that the hyperv_drm module does not expose PCI hardware
  information since it's a virtual device.

   * The upstream patch [2] addresses the issue and it's confirmed that
  the xserver with the patch does not experience the crash.

   * Ubuntu Focal `xorg-server` package does not include the patch [2]
  at the moment (xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6).

   [1]: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-grid-drivers-on-nv-or-nvv3-series-vms
   [2]: https://github.com/freedesktop/xorg-xserver/commit/0d93bbfa2cfacbb73741f8bed0e32fa1a656b928

  [ Test Plan ]

  Part (a) is quoted from Microsoft's official guide [1].

  Part (a):

   * Spawn a Microsoft Azure NV-series instance with an NVidia GRID-supported GPU
     - e.g. `NV36adms A10`
   * Install updates, required tooling, and the desktop environment:
     - sudo apt-get update
     - sudo apt-get upgrade -y
     - sudo apt-get dist-upgrade -y
     - sudo apt-get install build-essential ubuntu-desktop -y
     - sudo apt-get install linux-azure -y
   * Disable nouveau kernel driver:
     # Create a blacklist file /etc/modprobe.d/nouveau.conf with following contents:
     blacklist nouveau
     blacklist lbm-nouveau
   * Reboot the VM, re-connect, and then stop X server:
     - sudo reboot
     # wait for the reboot, reconnect, and continue:
     - sudo systemctl stop lightdm.service
   * Download and install the NVidia GRID driver:
     - wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272
     - chmod +x NVIDIA-Linux-x86_64-grid.run
     - sudo ./NVIDIA-Linux-x86_64-grid.run
     - # When the setup asks whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes.
   * Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf
     - sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
   * Edit /etc/nvidia/grid.conf
     - sudo nano /etc/nvidia/grid.conf
     # Append the following lines:
     IgnoreSP=FALSE
     EnableUI=FALSE
     # Remove this line if present:
     FeatureType=0
     # And save.
   * Reboot the VM

   Part (b):

    * Ensure that the hyperv_drm kernel module is loaded:
      - sudo modprobe hyperv_drm
    * Use the attached xorg.conf file to override /etc/X11/xorg.conf file
    * try to start the `xserver`:
      - sudo startx
    * `xserver` should crash with a similar output to the following:
    X.Org X Server 1.20.13
    X Protocol Version 11, Revision 0
    Build Operating System: linux Ubuntu
    Current Operating System: Linux a10test 5.15.0-1031-azure #38~20.04.1-Ubuntu SMP Mon Jan 9 18:23:48 UTC 2023 x86_64
    Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1031-azure root=PARTUUID=4cac852b-afba-447b-b3e7-c002155c1305 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 panic=-1
    Build Date: 07 February 2023  12:48:13PM
    xorg-server 2:1.20.13-1ubuntu1~20.04.6 (For technical support please see http://www.ubuntu.com/support)
    Current version of pixman: 0.38.4
      Before reporting problems, check http://wiki.x.org
      to make sure that you have the latest version.
    Markers: (--) probed, (**) from config file, (==) default setting,
      (++) from command line, (!!) notice, (II) informational,
      (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
    (==) Log file: "/var/log/Xorg.1.log", Time: Sat Feb 18 10:54:26 2023
    (==) Using config file: "/etc/X11/xorg.conf"
    (==) Using system config directory "/usr/share/X11/xorg.conf.d"
    (EE)
    (EE) Backtrace:
    (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55e7787c5ecc]
    (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7f9576cac420]
    (EE) 2: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0xa7) [0x55e7786c4db7]
    (EE) 3: /usr/lib/xorg/Xorg (xf86PlatformMatchDriver+0x700) [0x55e7786bf1b0]
    (EE) 4: /usr/lib/xorg/Xorg (xf86CallDriverProbe+0x5c) [0x55e7786971dc]
    (EE) 5: /usr/lib/xorg/Xorg (xf86BusConfig+0x43) [0x55e778697b23]
    (EE) 6: /usr/lib/xorg/Xorg (InitOutput+0x90b) [0x55e7786a59eb]
    (EE) 7: /usr/lib/xorg/Xorg (InitFonts+0x1d4) [0x55e778667ea4]
    (EE) 8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0x7f9576ac8083]
    (EE) 9: /usr/lib/xorg/Xorg (_start+0x2e) [0x55e778651ace]
    (EE)
    (EE) Segmentation fault at address 0x124
    (EE)
    Fatal server error:
    (EE) Caught signal 11 (Segmentation fault). Server aborting
    (EE)
    (EE)
    Please consult the The X.Org Foundation support
       at http://wiki.x.org
     for help.
    (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information.
    (EE)
    (EE) Server terminated with error (1). Closing log file.
    ^Cxinit: giving up
    xinit: unable to connect to X server: Connection refused
    xinit: unexpected signal 2

  # To verify patch fixes the issue:
  * Enable the following PPA that includes the fix: 
    - sudo add-apt-repository ppa:mustafakemalgilor/lp2007746
    - sudo apt update
  * Install the package
    - sudo apt install xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6ubuntu1
  * Try to start xserver:
    - sudo startx
  * xserver should not crash.
     

  [ Where problems could occur ]

   * The regression risk is low, given that the patch is well-isolated
  and basically adds a null check that is already assumed to be there in
  the first place.

  [ Other Info ]

   * workaround #1: unload hyperv_drm kernel module:
     - sudo modprobe -r hyperv_drm
   * workaround #2: Comment out BusID line in /etc/X11/xorg.conf [Device] section:
     Section "Device"
        Identifier     "Device0"
        Driver         "nvidia"
        VendorName     "NVIDIA Corporation"
        # BusID          "PCI:0@32828:0:0"
        Option         "HardDPMS" "false"
        Option         "CustomEDID" "DFP-0:/etc/X11/vdisplay.edid"
     EndSection

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/2007746/+subscriptions