← Back to team overview

kernel-packages team mailing list archive

[Bug 1398497] Re: HP Proliant Serverrs - DL360 and DL380 Gen8 - Precise Kernel Panic - General Protection Fault

 

Regarding the following stack trace (for every task that caused the CPU
exception)...

(gdb) list *(__kmalloc+0x7b)
0xffffffff8116611b is in __kmalloc (/build/buildd/linux-3.2.0/mm/slub.c:2325).
2320 * 3. If they were not changed replace tid and freelist
2321 *
2322 * Since this is without lock semantics the protection is only against
2323 * code executing on this cpu *not* from access by other cpus.
2324 */
2325 if (unlikely(!irqsafe_cpu_cmpxchg_double(
2326 s->cpu_slab->freelist, s->cpu_slab->tid,
2327 object, tid,
2328 get_freepointer_safe(s, object), next_tid(tid)))) {

RIP = 0xffffffff8116611b == mov 0x0(%r13,%rax,1),%rbx

(gdb) list *(kmem_cache_alloc_trace+0x5e)
0xffffffff8116657e is in kmem_cache_alloc_trace (/build/buildd/linux-3.2.0/mm/slub.c:2325).
2320 * 3. If they were not changed replace tid and freelist
2321 *
2322 * Since this is without lock semantics the protection is only against
2323 * code executing on this cpu *not* from access by other cpus.
2324 */
2325 if (unlikely(!irqsafe_cpu_cmpxchg_double(
2326 s->cpu_slab->freelist, s->cpu_slab->tid,
2327 object, tid,
2328 get_freepointer_safe(s, object), next_tid(tid)))) {
2329

RIP = 0xffffffff8116657e == mov 0x0(%r13,%rax,1),%rbx
Following assembly code from objdump of a same version compiled kernel (it would be awesome to get a core dump to confirm this):

if (unlikely(!irqsafe_cpu_cmpxchg_double(
========

irqsafe_cpu_cmpxchg_double:

0xffffffff81166576 <+86>: mov (%r12),%rsi
0xffffffff8116657e <+94>: mov 0x0(%r13,%rax,1),%rbx ### R13 = UPPER HALF OF BASE POINTER
0xffffffff81166583 <+99>: mov %r13,%rax
0xffffffff81166586 <+102>: callq 0xffffffff8131cb20 ### CALL

(gdb) x/30i 0xffffffff8131cb20 ### CALLED
0xffffffff8131cb20: pushfq ### PUSH RFLAGS into stack
0xffffffff8131cb21: cli ### **** CLEAR INTERRUPT FLAG ****
0xffffffff8131cb22: cmp %gs:(%rsi),%rax

Since the execution path here is:

irqsafe_cpu_cmpxchg_double (#define) ->
irqsafe_generic_cpu_cmpxchg_double (#define) ->
local_irq_save(#define)->...

I'm inclined to say the CPU 35 caused a general protection fault when
trying to execute:

local_irq_save()

responsible to save the current state of local interrupt delivery (list
of interrupts enabled and disabled for the particular CPU) and disable
interrupt delivery for that particular CPU using "CLI" instruction
(CLEAR INTERRUPT FLAG).

"""
CLI is commonly used as a synchronization mechanism in uniprocessor systems. For example, a CLI is used in operating systems to disable interrupts so kernel code (typically a driver) can avoid race conditions with an interrupt handler. Note that CLI only affects the interrupt flag for the processor on which it is executed; in multiprocessor systems, executing a CLI instruction does not disable interrupts on other processors.
"""

And that is the case:

/*
* The cmpxchg will only match if there was no additional
* operation and if we are on the right processor.
*
* The cmpxchg does the following atomically (without lock semantics!)
* 1. Relocate first pointer to the current per cpu area.
* 2. Verify that tid and freelist have not been changed
* 3. If they were not changed replace tid and freelist
*
* Since this is without lock semantics the protection is only against
* code executing on this cpu *not* from access by other cpus.
*/

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1398497

Title:
  HP Proliant Serverrs - DL360 and DL380 Gen8 - Precise Kernel Panic -
  General Protection Fault

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Precise:
  Incomplete

Bug description:
  It was brought to my attention the following situation:

  """
  We massively upgraded our Ubuntu 12.04 servers (most of them are HP
  DL360p Gen8 or DL380 Gen8) to 3.2.0-67 kernel And in the last 2-3
  days we already had to reboot 5 of them because they completely hang

  Some of them had the following messages under syslog :
  kernel: [384707.675479] general protection fault: 0000 [#5666] SMP

  others had :
  kernel: [950725.612724] BUG: unable to handle kernel paging request

  All of them have this also :
  your BIOS is broken and requested that x2apic be disabled
  """

  Comments bellow

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398497/+subscriptions


References