kernel-packages team mailing list archive

Thread
Date
[Bug 1534413] Re: Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED

To: kernel-packages@xxxxxxxxxxxxxxxxxxx
From: Rafael David Tinoco <rafael.tinoco@xxxxxxxxxxxxx>
Date: Fri, 15 Jan 2016 01:53:50 -0000
Reply-to: Bug 1534413 <1534413@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Disconsider my previous analysis please. After disassembling carefully the java
process dump I could see that we had the instruction pointer at:

    if (iter.task)
        put_task_struct(iter.task);

Some instructions before a loop that could be a potential candidate for deadlocks, which
I was referring to before.

   0xffffffff811df242 <+34>:    je     0xffffffff811df24f <next_tgid+47>
   0xffffffff811df244 <+36>:    lock decl 0x10(%rdx)
   0xffffffff811df248 <+40>:    sete   %al
   0xffffffff811df24b <+43>:    test   %al,%al

The combination of those 3 instructions (decl, sete & test) shows me that this is
actually the implementation of atomic_dec_and_test function. It servers to atomically
(CPU coordination) decrement a variable from task_struct for the given task.

Deadlock is probably NOT there.

I went for khungtask stack to check which task it was complaining about being locked
and I could not find it (probably taskid was in registers and not on the stack) BUT
I could find an output message telling me which task was locked:

[15956.371298] INFO: task jsvc:57263 blocked for more than 120 seconds.

And this comes exactly from khungtaskd code. It means that PID 57263 was NOT
scheduled for more than 120 seconds. Checking its stack:

    PID: 57263
COMMAND: "jsvc"
   TASK: ffff883e0c4e2e00  [THREAD_INFO: ffff883e849c2000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE|TASK_TRACED|EXIT_DEAD

I can see that it is in UNINTERRUPTIBLE state (it can't received signals, for ex).
Checking its stack:

crash> bt
PID: 57263  TASK: ffff883e0c4e2e00  CPU: 1   COMMAND: "jsvc"
 #0 [ffff883e849c3c88] __schedule at ffffffff8166256a
 #1 [ffff883e849c3d10] schedule at ffffffff81662c0f
 #2 [ffff883e849c3d20] schedule_timeout at ffffffff8166324d
 #3 [ffff883e849c3dd0] wait_for_common at ffffffff81662a4f
 #4 [ffff883e849c3e70] wait_for_completion at ffffffff81662bcd
 #5 [ffff883e849c3e80] flush_work at ffffffff8108757e
 #6 [ffff883e849c3ed0] schedule_on_each_cpu at ffffffff81087813
 #7 [ffff883e849c3f10] lru_add_drain_all at ffffffff81127365
 #8 [ffff883e849c3f20] sys_fadvise64_64 at ffffffff8111e189
 #9 [ffff883e849c3f70] sys_fadvise64 at ffffffff8111e27e
#10 [ffff883e849c3f80] system_call_fastpath at ffffffff8166d2c2

## sys_fadvise64_64

The task entered kernel mode after the syscall: fadvise. It basically tells the kernel
how the file is going to be accessed so it can prepare itself regarding to different
caching techniques (expect page references in random order, expect page references
in sequential order, expect delay in accessing file contents, etc).

It turns out that the JSVC process basically told kernel that it would NOT use the
opened file (flag POSIX_FADV_DONTNEED) until further moment, at a certain offset. With
that, kernel calculated page start/end index (offset, offset+len) for the memory range
of given file and tried to invalidate all unlocked pages (page cache) for that area.
Dirty, locked, write-back or mapped pages are NOT invalidated.

After first attempt of invalidating pages from page-cache from that particular file,
Kernel found out that fewer pages (than the ones given to fadvise - in the form of
offset/len delta) than expected were invalidated (in other words, it could not invalidate
all pages from the the range) so it resolved calling the LRU DRAIN logic.

## lru_add_drain_all

LRU stands for Least-Recent-Used algorithm and is used for user mode address space and
for page cache (this case). It defines 2 lists of memory pages: active and inactive. The
LRU DRAIN logic schedules a drain logic on every CPU on the system in order to drain
per-cpu existent page vectors (pagevecs).

Page vecs are temporary small & finite data structures containing page descriptor pointers.
Instead of changing LRU lists (active/inactive) directly, the LRU algorithm uses per-cpu
page vectors until they are full (avoiding having to lock/unlock while LRU lists more often).

## schedule_on_each_cpu

Task ffff883e0c4e2e00 continued the LRU DRAIN logic by scheduling its call back function using
schedule_on_each_cpu:

schedule_on_each_cpu(drain_cpu_pagevecs) scheduled the drain_cpu_pagevecs on all cpus
and kept waiting - for more than 120 seconds - for all cpus to execute the drain.

While it was waiting, the task ffff883e0c4e2e00 was made UNINTERRUPTIBLE and was locked
inside this loop where it waits for drain_cpu_pagevecs to finish on all cpus (it never did).

## precise commit (cherry-picked from upstream) adding that logic:

commit a68692ea42825466b0ade8a94a98afc924d2564e
Author: Mel Gorman <mgorman@xxxxxxx>
Date:   Fri Feb 22 16:35:59 2013 -0800

    mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to
discard all pages

    BugLink: http://bugs.launchpad.net/bugs/1150557

    commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 upstream.

It was added to Precise (3.2.0-40.64).

I'm removing that logic and providing a kernel without it to see if it
mitigates the issue.

Per original commit description, this was supposed to fix an erratic behaviour seen in
s390x architecture for the sys_fadvise64_64 system call. I'm afraid it introduced a race
condition that might have disappeared because of code refactoring in further commits.
If not I'll check if the logic is still plausible for upstream and discuss it appropriately.

I'll provide a hotfixed kernel which I would like to get feedback about.

Thank you

Rafael

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1534413

Title:
  Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED

Status in linux package in Ubuntu:
  In Progress

Bug description:
  It was brought to my knowledge a kernel dump (3.2.0-79) with the
  following stack trace:

  """
  [14277.952072] usb 3-1: USB disconnect, device number 3
  [15388.602790] NOHZ: local_softirq_pending 02
  [15404.593795] NOHZ: local_softirq_pending 02
  [15436.575787] NOHZ: local_softirq_pending 02
  [15452.566802] NOHZ: local_softirq_pending 02
  [15456.564528] NOHZ: local_softirq_pending 02
  [15564.503842] NOHZ: local_softirq_pending 02
  [15584.492538] NOHZ: local_softirq_pending 02
  [15588.490302] NOHZ: local_softirq_pending 02
  [15632.465563] NOHZ: local_softirq_pending 02
  [15659.014629] NOHZ: local_softirq_pending 02
  [15956.371298] INFO: task jsvc:57263 blocked for more than 120 seconds.
  [15956.375347] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [15956.383487] jsvc            D ffffffff81806200     0 57263   9495 0x00000000
  [15956.383493]  ffff883e849c3d08 0000000000000082 ffff883e849c3ca8 000000008104f624
  [15956.383502]  ffff883e849c3fd8 ffff883e849c3fd8 ffff883e849c3fd8 0000000000012800
  [15956.383509]  ffff881f72db9700 ffff883e0c4e2e00 ffff883e849c3cf8 7fffffffffffffff
  [15956.383518] Call Trace:
  [15956.383534]  [<ffffffff81662c0f>] schedule+0x3f/0x60
  [15956.383539]  [<ffffffff8166324d>] schedule_timeout+0x29d/0x310
  [15956.383547]  [<ffffffff810388be>] ? physflat_send_IPI_mask+0xe/0x10
  [15956.383554]  [<ffffffff81032068>] ? native_smp_send_reschedule+0x48/0x60
  [15956.383560]  [<ffffffff8103ec29>] ? default_spin_lock_flags+0x9/0x10
  [15956.383564]  [<ffffffff81662a4f>] wait_for_common+0xdf/0x180
  [15956.383572]  [<ffffffff81060ac0>] ? try_to_wake_up+0x200/0x200
  [15956.383576]  [<ffffffff81662bcd>] wait_for_completion+0x1d/0x20
  [15956.383585]  [<ffffffff8108757e>] flush_work+0x2e/0x40
  [15956.383589]  [<ffffffff810838b0>] ? wake_up_worker+0x30/0x30
  [15956.383593]  [<ffffffff81087813>] schedule_on_each_cpu+0xc3/0x110
  [15956.383602]  [<ffffffff81127365>] lru_add_drain_all+0x15/0x20
  [15956.383607]  [<ffffffff8111e189>] sys_fadvise64_64+0x189/0x270
  [15956.383610]  [<ffffffff8111e27e>] sys_fadvise64+0xe/0x10
  [15956.383619]  [<ffffffff8166d2c2>] system_call_fastpath+0x16/0x1b
  [15956.383622] Kernel panic - not syncing: hung_task: blocked tasks
  [15956.388083] Pid: 178, comm: khungtaskd Tainted: G        W    3.2.0-79-generic #115-Ubuntu
  [15956.397273] Call Trace:
  [15956.401783]  [<ffffffff8164c005>] panic+0x91/0x1a4
  [15956.406527]  [<ffffffff810d97c2>] check_hung_task+0xb2/0xc0
  [15956.411393]  [<ffffffff810d98eb>] check_hung_uninterruptible_tasks+0x11b/0x140
  [15956.421117]  [<ffffffff810d9910>] ? check_hung_uninterruptible_tasks+0x140/0x140
  [15956.431847]  [<ffffffff810d995f>] watchdog+0x4f/0x60
  [15956.437524]  [<ffffffff8108b99c>] kthread+0x8c/0xa0
  [15956.443145]  [<ffffffff8166f434>] kernel_thread_helper+0x4/0x10
  [15956.448830]  [<ffffffff8108b910>] ? flush_kthread_worker+0xa0/0xa0
  [15956.454700]  [<ffffffff8166f430>] ? gs_change+0x13/0x13
  """

  Analysis being made on the comments...

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1534413/+subscriptions
References

[Bug 1534413] [NEW] Precise: lockup during fadvise syscall with POSIX_FADV_DONTNEED
From: Rafael David Tinoco, 2016-01-15