kernel-packages team mailing list archive

Thread
Date
[Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Rafael David Tinoco <rafael.tinoco@xxxxxxxxxxxxx>
Date: Fri, 15 Jan 2016 01:56:03 -0000
Reply-to: Bug 1534413 <1534413@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Disconsider my previous analysis please. After disassembling carefully
the java process dump I could see that we had the instruction pointer
at:

    if (iter.task)
        put_task_struct(iter.task);

Some instructions before a loop that could be a potential candidate for
deadlocks, which I was referring to before.

   0xffffffff811df242 <+34>:    je     0xffffffff811df24f <next_tgid+47>
   0xffffffff811df244 <+36>:    lock decl 0x10(%rdx)
   0xffffffff811df248 <+40>:    sete   %al
   0xffffffff811df24b <+43>:    test   %al,%al

The combination of those 3 instructions (decl, sete & test) shows me
that this is actually the implementation of atomic_dec_and_test
function. It servers to atomically (CPU coordination) decrement a
variable from task_struct for the given task.

Deadlock is probably NOT there.

I went for khungtask stack to check which task it was complaining about
being locked and I could not find it (probably taskid was in registers
and not on the stack) BUT I could find an output message telling me
which task was locked:

[15956.371298] INFO: task jsvc:57263 blocked for more than 120 seconds.

And this comes exactly from khungtaskd code. It means that PID 57263 was
NOT scheduled for more than 120 seconds. Checking its stack:

    PID: 57263
COMMAND: "jsvc"
   TASK: ffff883e0c4e2e00  [THREAD_INFO: ffff883e849c2000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE|TASK_TRACED|EXIT_DEAD

I can see that it is in UNINTERRUPTIBLE state (it can't received
signals, for ex). Checking its stack:

crash> bt
PID: 57263  TASK: ffff883e0c4e2e00  CPU: 1   COMMAND: "jsvc"
 #0 [ffff883e849c3c88] __schedule at ffffffff8166256a
 #1 [ffff883e849c3d10] schedule at ffffffff81662c0f
 #2 [ffff883e849c3d20] schedule_timeout at ffffffff8166324d
 #3 [ffff883e849c3dd0] wait_for_common at ffffffff81662a4f
 #4 [ffff883e849c3e70] wait_for_completion at ffffffff81662bcd
 #5 [ffff883e849c3e80] flush_work at ffffffff8108757e
 #6 [ffff883e849c3ed0] schedule_on_each_cpu at ffffffff81087813
 #7 [ffff883e849c3f10] lru_add_drain_all at ffffffff81127365
 #8 [ffff883e849c3f20] sys_fadvise64_64 at ffffffff8111e189
 #9 [ffff883e849c3f70] sys_fadvise64 at ffffffff8111e27e
#10 [ffff883e849c3f80] system_call_fastpath at ffffffff8166d2c2

## sys_fadvise64_64

The task entered kernel mode after the syscall: fadvise. It basically
tells the kernel how the file is going to be accessed so it can prepare
itself regarding to different caching techniques (expect page references
in random order, expect page references in sequential order, expect
delay in accessing file contents, etc).

It turns out that the JSVC process basically told kernel that it would
NOT use the opened file (flag POSIX_FADV_DONTNEED) until further moment,
at a certain offset. With that, kernel calculated page start/end index
(offset, offset+len) for the memory range of given file and tried to
invalidate all unlocked pages (page cache) for that area. Dirty, locked,
write-back or mapped pages are NOT invalidated.

After first attempt of invalidating pages from page-cache from that
particular file, Kernel found out that fewer pages (than the ones given
to fadvise - in the form of offset/len delta) than expected were
invalidated (in other words, it could not invalidate all pages from the
the range) so it resolved calling the LRU DRAIN logic.

## lru_add_drain_all

LRU stands for Least-Recent-Used algorithm and is used for user mode
address space and for page cache (this case). It defines 2 lists of
memory pages: active and inactive. The LRU DRAIN logic schedules a drain
logic on every CPU on the system in order to drain per-cpu existent page
vectors (pagevecs).

Page vecs are temporary small & finite data structures containing page
descriptor pointers. Instead of changing LRU lists (active/inactive)
directly, the LRU algorithm uses per-cpu page vectors until they are
full (avoiding having to lock/unlock while LRU lists more often).

## schedule_on_each_cpu

Task ffff883e0c4e2e00 continued the LRU DRAIN logic by scheduling its
call back function using schedule_on_each_cpu:

schedule_on_each_cpu(drain_cpu_pagevecs) scheduled the
drain_cpu_pagevecs on all cpus and kept waiting - for more than 120
seconds - for all cpus to execute the drain.

While it was waiting, the task ffff883e0c4e2e00 was made UNINTERRUPTIBLE
and was locked inside this loop where it waits for drain_cpu_pagevecs to
finish on all cpus (it never did).

## precise commit (cherry-picked from upstream) adding that logic:

commit a68692ea42825466b0ade8a94a98afc924d2564e
Author: Mel Gorman <mgorman@xxxxxxx>
Date:   Fri Feb 22 16:35:59 2013 -0800

    mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to
discard all pages

    BugLink: http://bugs.launchpad.net/bugs/1150557

    commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 upstream.

It was added to Precise (3.2.0-40.64).

I'm removing that logic and providing a kernel without it to see if it
mitigates the issue.

Per original commit description, this was supposed to fix an erratic
behaviour seen in s390x architecture for the sys_fadvise64_64 system
call. I'm afraid it introduced a race condition that might have
disappeared because of code refactoring in further commits. If not I'll
check if the logic is still plausible for upstream and discuss it
appropriately.

I'll provide a hotfixed kernel which I would like to get feedback about.

Thank you

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1534413

Title:
  Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED

Status in linux package in Ubuntu:
  In Progress

Bug description:
  It was brought to my knowledge a kernel dump (3.2.0-79) with the
  following stack trace:

  """
  [14277.952072] usb 3-1: USB disconnect, device number 3
  [15388.602790] NOHZ: local_softirq_pending 02
  [15404.593795] NOHZ: local_softirq_pending 02
  [15436.575787] NOHZ: local_softirq_pending 02
  [15452.566802] NOHZ: local_softirq_pending 02
  [15456.564528] NOHZ: local_softirq_pending 02
  [15564.503842] NOHZ: local_softirq_pending 02
  [15584.492538] NOHZ: local_softirq_pending 02
  [15588.490302] NOHZ: local_softirq_pending 02
  [15632.465563] NOHZ: local_softirq_pending 02
  [15659.014629] NOHZ: local_softirq_pending 02
  [15956.371298] INFO: task jsvc:57263 blocked for more than 120 seconds.
  [15956.375347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [15956.383487] jsvc            D ffffffff81806200     0 57263   9495 0x00000000
  [15956.383493]  ffff883e849c3d08 0000000000000082 ffff883e849c3ca8 000000008104f624
  [15956.383502]  ffff883e849c3fd8 ffff883e849c3fd8 ffff883e849c3fd8 0000000000012800
  [15956.383509]  ffff881f72db9700 ffff883e0c4e2e00 ffff883e849c3cf8 7fffffffffffffff
  [15956.383518] Call Trace:
  [15956.383534]  [<ffffffff81662c0f>] schedule+0x3f/0x60
  [15956.383539]  [<ffffffff8166324d>] schedule_timeout+0x29d/0x310
  [15956.383547]  [<ffffffff810388be>] ? physflat_send_IPI_mask+0xe/0x10
  [15956.383554]  [<ffffffff81032068>] ? native_smp_send_reschedule+0x48/0x60
  [15956.383560]  [<ffffffff8103ec29>] ? default_spin_lock_flags+0x9/0x10
  [15956.383564]  [<ffffffff81662a4f>] wait_for_common+0xdf/0x180
  [15956.383572]  [<ffffffff81060ac0>] ? try_to_wake_up+0x200/0x200
  [15956.383576]  [<ffffffff81662bcd>] wait_for_completion+0x1d/0x20
  [15956.383585]  [<ffffffff8108757e>] flush_work+0x2e/0x40
  [15956.383589]  [<ffffffff810838b0>] ? wake_up_worker+0x30/0x30
  [15956.383593]  [<ffffffff81087813>] schedule_on_each_cpu+0xc3/0x110
  [15956.383602]  [<ffffffff81127365>] lru_add_drain_all+0x15/0x20
  [15956.383607]  [<ffffffff8111e189>] sys_fadvise64_64+0x189/0x270
  [15956.383610]  [<ffffffff8111e27e>] sys_fadvise64+0xe/0x10
  [15956.383619]  [<ffffffff8166d2c2>] system_call_fastpath+0x16/0x1b
  [15956.383622] Kernel panic - not syncing: hung_task: blocked tasks
  [15956.388083] Pid: 178, comm: khungtaskd Tainted: G        W    3.2.0-79-generic #115-Ubuntu
  [15956.397273] Call Trace:
  [15956.401783]  [<ffffffff8164c005>] panic+0x91/0x1a4
  [15956.406527]  [<ffffffff810d97c2>] check_hung_task+0xb2/0xc0
  [15956.411393]  [<ffffffff810d98eb>] check_hung_uninterruptible_tasks+0x11b/0x140
  [15956.421117]  [<ffffffff810d9910>] ? check_hung_uninterruptible_tasks+0x140/0x140
  [15956.431847]  [<ffffffff810d995f>] watchdog+0x4f/0x60
  [15956.437524]  [<ffffffff8108b99c>] kthread+0x8c/0xa0
  [15956.443145]  [<ffffffff8166f434>] kernel_thread_helper+0x4/0x10
  [15956.448830]  [<ffffffff8108b910>] ? flush_kthread_worker+0xa0/0xa0
  [15956.454700]  [<ffffffff8166f430>] ? gs_change+0x13/0x13
  """

  Analysis being made on the comments...

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534413/+subscriptions
References

[Bug 1534413] [NEW] Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED
From: Rafael David Tinoco, 2016-01-15