← Back to team overview

kernel-packages team mailing list archive

Re: [Bug 1469214] Re: HP ProLiant m400 Server crashes with unhandled level 3 translation fault

 

Hi Colin,

On Sat, Jul 4, 2015 at 12:43 AM, Colin Ian King
<1469214@xxxxxxxxxxxxxxxxxx> wrote:
> I was able to hit the following translation fault running sudo ./stress-
> ng --seq 0 -t 60 --syslog --metrics --times -v

I suggest to not run stress-ng as root, otherwise it can be less
serious because:

  - root user can do bad things easily, and it is quite easy to kill any
of process
  - in reality most of loads are run as non-root

If some system processes(irqbalance, systemd-*) are only killed
becasue stress-ng is running as root, it can be a low priority issue.
Otherwise we need pay close attention to the issue.

And I always run 'stress-ng' as ubuntu user without sudo, that may
be the reason why it is difficult for me to reproduce that.

Even with the two new approaches, it is still not easy for me to
reproduce that. I only see one time of translation fault by your
first approach(./stress-ng --seq 0 ...)  in 6 hours, and can't trigger
that with your 2nd approach(by bash script).

Folllows the log[1] I triggered, and I think it is very likely a userspace
issue. From irqbalanc-dbgsym package, we can easily find 'PC is at
0x406078' is one address in text section, and it should be inside
function of 'place_irq_in_node' because the exec file isn't built as
relocation. One thing I still can't understand is that why the fault
address is '0x00000040' in the context.


[1]
[ 3616.333392] Bits 55-60 of /proc/PID/pagemap entries are about to
stop being page-shift some time soon. See the
linux/Documentation/vm/pagemap.txt for details.
[ 3616.333393] Bits 55-60 of /proc/PID/pagemap entries are about to
stop being page-shift some time soon. See the
linux/Documentation/vm/pagemap.txt for details.
[ 5316.367265] irqbalance[1457]: unhandled level 2 translation fault
(11) at 0x00000040, esr 0x92000006
[ 5316.476937] pgd = ffffffcfb5478000
[ 5316.520692] [00000040] *pgd=0000004fb4a3c003,
*pud=0000004fb4a3c003, *pmd=0000000000000000
[ 5316.620270]
[ 5316.638140] CPU: 7 PID: 1457 Comm: irqbalance Not tain-21-generic #21-Ubuntu
[ 5316.733212] Hardware name: HP ProLiant m400 Server Cartridge (DT)
[ 5316.806382] task: ffffffcfb55e6e40 ti: ffffffcfa72b0000 task.ti:
ffffffcfa72b0000
[ 5316.896258] PC is at 0x406078
[ 5316.931865] LR is at 0x404100
[ 5316.967457] pc : [<0000000000406078>] lr : [<0000000000404100>]
pstate: 20000000
[ 5317.056268] sp : 0000007fc07ff2d0
[ 5317.096038] x29: 0000007fc07ff2d0 x28: 00000000004095a0
[ 5317.160023] x27: 0000000000409548 x26: 000000000041a000
[ 5317.223897] x25: 0000000000405000 x24: 000000000041acf8
[ 5317.287868] x23: 000000000041a000 x22: 000000000041a000
[ 5317.351841] x21: 000000002e0d6050 x20: 000000000041a000
[ 5317.415744] x19: 000000002e0e9020 x18: 0000000000000000
[ 5317.479620] x17: 0000007fb5ac287c x16: 000000000041a188
[ 5317.543490] x15: 003bdd2370f74a1c x14: 2030203020302030
[ 5317.607373] x13: 2030203020302030 x12: 2030203020302030
[ 5317.671263] x11: 2030203020302030 x10: 2030203020302030
[ 5317.735137] x9 : 00000000000000a0 x8 : 0000000000000001
[ 5317.799113] x7 : 0000000000000033 x6 : 000000002e0d6e08
[ 5317.862983] x5 : 0000000000000040 x4 : 0000000000000000
[ 5317.926867] x3 : 000000002e0d7008 x2 : 0000000000000000
[ 5317.990840] x1 : 000000000000002c x0 : 0000000000000003
[ 5318.054713]

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1469214

Title:
  HP ProLiant m400 Server crashes with unhandled level 3 translation
  fault

Status in linux package in Ubuntu:
  Triaged

Bug description:
  Running stress-ng on a HP ProLiant m400 server can cause unhandled
  level 3 translations faults:

  use stress-ng from git://kernel.ubuntu.com/cking/stress-ng

  ./stress-ng --seq 0 -t 60 -v

  and after some time this trips the following:

  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922560] systemd-timesyn[481]: unhandled level 3 translation fault (7) at 0x7fa8ea6008, esr 0x92000007
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922561] pgd = ffffffcfb563f000
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922563] [7fa8ea6008] *pgd=0000004fb4f28003, *pud=0000004fb4f28003, *pmd=0000004fb4f38003, *pte=000000001d151c00
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922566]
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922569] CPU: 6 PID: 481 Comm: systemd-timesyn Not tainted 3.19.0-21-generic #21-Ubuntu
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922571] Hardware name: HP ProLiant m400 Server Cartridge (DT)
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922573] task: ffffffcfb4e3b100 ti: ffffffcfb4d2c000 task.ti: ffffffcfb4d2c000
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922588] PC is at 0x7fa8d81824
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922589] LR is at 0x7fa8e3b3e4
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922590] pc : [<0000007fa8d81824>] lr : [<0000007fa8e3b3e4>] pstate: 80000000
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922591] sp : 0000007ff120d660
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922592] x29: 0000007ff120d660 x28: 0000007fa8f1c000
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922594] x27: 0000007fa8f32084 x26: 0000007fa8f32000
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922595] x25: 0000007fa8f1d788 x24: 0000007fa8f1d888
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922597] x23: 0000000000000001 x22: 0000007fa8f1faa0
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922599] x21: 0000007ff120d7f0 x20: 0000007ff120d7d0
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922600] x19: 0000007fa8f31000 x18: 0000007fa8f1e000
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922602] x17: 0000007fa8e3b3b8 x16: 0000007fa8ea6000
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922603] x15: 003b9aca00000000 x14: 00219bbdd0000000
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922605] x13: ffffffffaa751223 x12: 0000000000000000
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922607] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922609] x9 : 37333c43484f5e46 x8 : 0000007ff120d818
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922610] x7 : 0000007ff120d8f0 x6 : 0000007ff120d828
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922612] x5 : ffffff80ffffffd0 x4 : 0000007ff120d8c0
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922613] x3 : 0000007ff120d7d0 x2 : 0000007fa8f1faa0
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922615] x1 : 0000000000000001 x0 : 0000000000000064
  Jun 26 14:01:54 ms10-34-proliant kernel: [150297.922616]

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1469214/+subscriptions


References