← Back to team overview

kernel-packages team mailing list archive

[Bug 1491797] Re: Shuts down when supposed to suspend as a reaction to self-caused overheat, session lost

 

Let's take this one point at a time:
* fan not running at full speed in disengaged mode in a thermal emergency
   - as mentioned earlier, the default fan mode on the machine is to run under firmware control, in which case it runs in engaged mode with a loop feed back controller so it never exceeds a top speed of 3500 RPM.   This matches the original thermal design by the manufacturer.  So either they made a mistake and all machines like yours overheat (and we would see lots of owners with your machine reporting this bug) or this issue is particular to your machine

* CPU not throttling in a thermal emergency (unless the frequency readings are wrong)
  - that needs investigation as thermald should be doing that (but as I mentioned earlier, I will examine the thermald issues later)

* shutting down when supposed to suspend as a reaction to overheat, unnecessarily destroying session
  - when a critical thermal event occurs one has a very short time window to react. Potentially the silicon may be permanently damaged, so the kernel chooses to power down rather ran try to suspend (since this can get stuck and exacerbate the issue).  Without the handling of this thermal event, the next step is for the hardware to physically shut itself down which is out of any form of operating system control, so either way, the machine is desperately trying to save itself from breaking.

* destroying session in a shut down/restart cycle (I heard rumours this may be fixed later in Snappy with containers)
  - again, in a rush to save your silicon from becoming irreparably damaged shutdown is the fastest mechanism.  Snappy containers will not help. 

I'd recommend reading https://en.wikipedia.org/wiki/Thermal_design_power, there is  paragraph that states:
"Most modern processors will cause a therm-trip only upon a catastrophic cooling failure, such as a no longer operational fan or an incorrectly mounted heatsink."

So, the next step will be to see if we can see what thermald is doing.

1. Stop thermald so we can re-enable it with full debug on:

sudo systemctl stop thermald (if you are using systemd)

or

sudo service thermald stop (if you are using upstart)

2.  Run thermald for a while from the command line and capture debug
output:

sudo thermald --no-daemon --dbus-enable --loglevel=debug | tee
thermald.log

..run this say for 5-10 minutes and use your machine, then attach the
thermald.log to the bug report

3. Re-start themrald

sudo systemctl start thermald (if you are using systemd)

or

sudo service thermald start (if you are using upstart)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1491797

Title:
  Shuts down when supposed to suspend as a reaction to self-caused
  overheat, session lost

Status in linux package in Ubuntu:
  In Progress

Bug description:
  Error:
  Kernel foolishly shuts down the computer when it overheats.
  /var/log/kern.log
  W500 kernel: [1448.648529] thermal thermal_zone1: critical temperature reached (100 C), shutting down

  Consequence:
  Shutting down destroys session in Ubuntu, Gnome, and all applications that can't remember their latest conscious state (most applications).

  Attempted repair, failed:
  Laptop has suspending ability, but I can't find the setting for the kernel to make the computer suspend instead of shutting down.

  Repair suggestions:
  1. Persistence of session, so that everything would reappear after the restart. (this would also make updating less disruptive)
  2. Do not heat the machine like crazy; speed up fans or slow down processes. (problematic Lenovo Thinkpad W500 fan on low speed right up to the fiery end)
  3. Put the computer to suspend when it's too hot.

  (The problem has remained the same from at least Ubuntu 11.10 through
  14.04)

  ProblemType: Bug
  DistroRelease: Ubuntu 14.04
  Package: linux-image-3.13.0-62-generic 3.13.0-62.102
  ProcVersionSignature: Ubuntu 3.13.0-62.102-generic 3.13.11-ckt24
  Uname: Linux 3.13.0-62-generic x86_64
  ApportVersion: 2.14.1-0ubuntu3.12
  Architecture: amd64
  AudioDevicesInUse:
   USER        PID ACCESS COMMAND
   /dev/snd/controlC0:  user       2171 F.... pulseaudio
  CurrentDesktop: Unity
  Date: Thu Sep  3 13:42:28 2015
  HibernationDevice: RESUME=UUID=991e1383-ff5b-46c1-84c4-c904e1d81256
  InstallationDate: Installed on 2013-12-29 (612 days ago)
  InstallationMedia: Ubuntu 13.10 "Saucy Salamander" - Release amd64 (20131016.1)
  MachineType: LENOVO 4063B22
  PccardctlIdent:
   Socket 0:
     no product info available
  PccardctlStatus:
   Socket 0:
     no card
  ProcFB: 0 radeondrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-62-generic root=UUID=bd426989-b545-41b3-97b8-de9410f27aa6 ro persistent quiet splash vt.handoff=7
  RelatedPackageVersions:
   linux-restricted-modules-3.13.0-62-generic N/A
   linux-backports-modules-3.13.0-62-generic  N/A
   linux-firmware                             1.127.15
  SourcePackage: linux
  UpgradeStatus: Upgraded to trusty on 2014-04-27 (494 days ago)
  dmi.bios.date: 12/14/2011
  dmi.bios.vendor: LENOVO
  dmi.bios.version: 6FET92WW (3.22 )
  dmi.board.name: 4063B22
  dmi.board.vendor: LENOVO
  dmi.board.version: Not Available
  dmi.chassis.asset.tag: No Asset Information
  dmi.chassis.type: 10
  dmi.chassis.vendor: LENOVO
  dmi.chassis.version: Not Available
  dmi.modalias: dmi:bvnLENOVO:bvr6FET92WW(3.22):bd12/14/2011:svnLENOVO:pn4063B22:pvrThinkPadW500:rvnLENOVO:rn4063B22:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
  dmi.product.name: 4063B22
  dmi.product.version: ThinkPad W500
  dmi.sys.vendor: LENOVO

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1491797/+subscriptions


References