canonical-ubuntu-qa team mailing list archive

Thread
Date
[Bug 2085493] Re: futex_wait_bitset01, memfd_create0[1-4], copy_file_range0[1-3], statx[01-12], membarrier01, io_uring0[1-2], perf_event_open03 from ubuntu_ltp_syscalls failed on B-aws-5.4

To: canonical-ubuntu-qa@xxxxxxxxxxxxxxxxxxx
From: Koichiro Den <2085493@xxxxxxxxxxxxxxxxxx>
Date: Wed, 06 Nov 2024 11:54:10 -0000
Reply-to: Bug 2085493 <2085493@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
** Tags added: focal

-- 
You received this bug notification because you are a member of Canonical
Platform QA Team, which is subscribed to ubuntu-kernel-tests.
https://bugs.launchpad.net/bugs/2085493

Title:
  futex_wait_bitset01, memfd_create0[1-4], copy_file_range0[1-3],
  statx[01-12], membarrier01, io_uring0[1-2],  perf_event_open03 from
  ubuntu_ltp_syscalls failed on B-aws-5.4

Status in ubuntu-kernel-tests:
  New

Bug description:
  Occasionally, multiple test cases from ubuntu_ltp_syscalls fail on
  b:aws-5.4 with the same pettern. The following is futex_wait_bitset01
  failure log as an example.

  17:15:00 ERROR| Exception escaping from test:
  Traceback (most recent call last):
    File "/home/ubuntu/autotest/client/shared/test.py", line 411, in _exec
      _call_test_function(self.execute, *p_args, **p_dargs)
    File "/home/ubuntu/autotest/client/shared/test.py", line 823, in _call_test_function
      return func(*args, **dargs)
    File "/home/ubuntu/autotest/client/shared/test.py", line 291, in execute
      postprocess_profiled_run, args, dargs)
    File "/home/ubuntu/autotest/client/shared/test.py", line 212, in _call_run_once
      self.run_once(*args, **dargs)
    File "/home/ubuntu/autotest/client/tests/ubuntu_ltp_syscalls/ubuntu_ltp_syscalls.py", line 155, in run_once
      print(utils.system_output(cmd, verbose=False))
    File "/home/ubuntu/autotest/client/shared/utils.py", line 1271, in system_output
      verbose=verbose, args=args).stdout
    File "/home/ubuntu/autotest/client/shared/utils.py", line 918, in run
      "Command returned non-zero exit status")
  CmdError: Command </opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null> failed, rc=-7, Command returned non-zero exit status
  * Command:
      /opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null
  Exit status: -7
  Duration: 0.00109004974365

  17:15:00 INFO | post-test sysinfo error:
  17:15:00 INFO | Traceback (most recent call last):
  17:15:00 INFO |   File "/home/ubuntu/autotest/client/shared/log.py", line 74, in decorated_func
  17:15:00 INFO |     fn(*args, **dargs)
  17:15:00 INFO |   File "/home/ubuntu/autotest/client/base_sysinfo.py", line 304, in log_after_each_test
  17:15:00 INFO |     keyval = self.log_test_keyvals(test_sysinfodir)
  17:15:00 INFO |   File "/home/ubuntu/autotest/client/base_sysinfo.py", line 413, in log_test_keyvals
  17:15:00 INFO |     keyval["sysinfo-phys-mbytes"] = utils_memory.rounded_memtotal() // 1024
  17:15:00 INFO |   File "/home/ubuntu/autotest/client/shared/utils_memory.py", line 25, in rounded_memtotal
  17:15:00 INFO |     usable_kbytes = memtotal()
  17:15:00 INFO |   File "/home/ubuntu/autotest/client/shared/utils_memory.py", line 16, in memtotal
  17:15:00 INFO |     return read_from_meminfo('MemTotal')
  17:15:00 INFO |   File "/home/ubuntu/autotest/client/shared/utils_memory.py", line 10, in read_from_meminfo
  17:15:00 INFO |     cmd_result = utils.run('grep %s /proc/meminfo' % key, verbose=False)
  17:15:00 INFO |   File "/home/ubuntu/autotest/client/shared/utils.py", line 918, in run
  17:15:00 INFO |     "Command returned non-zero exit status")
  17:15:00 INFO | CmdError: Command <grep MemTotal /proc/meminfo> failed, rc=-7, Command returned non-zero exit status
  17:15:00 INFO | * Command:
  17:15:00 INFO |     grep MemTotal /proc/meminfo
  17:15:00 INFO | Exit status: -7
  17:15:00 INFO | Duration: 0.00239610671997
  17:15:00 INFO |
  17:15:00 ERROR| child process failed
  17:15:00 DEBUG| Traceback (most recent call last):
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/parallel.py", line 25, in fork_start
  17:15:00 DEBUG|     l()
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/job.py", line 505, in <lambda>
  17:15:00 DEBUG|     l = lambda: test.runtest(self, url, tag, args, dargs)
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/test.py", line 125, in runtest
  17:15:00 DEBUG|     job.sysinfo.log_after_each_iteration)
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/shared/test.py", line 913, in runtest
  17:15:00 DEBUG|     mytest._exec(args, dargs)
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/shared/test.py", line 411, in _exec
  17:15:00 DEBUG|     _call_test_function(self.execute, *p_args, **p_dargs)
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/shared/test.py", line 823, in _call_test_function
  17:15:00 DEBUG|     return func(*args, **dargs)
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/shared/test.py", line 291, in execute
  17:15:00 DEBUG|     postprocess_profiled_run, args, dargs)
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/shared/test.py", line 212, in _call_run_once
  17:15:00 DEBUG|     self.run_once(*args, **dargs)
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/tests/ubuntu_ltp_syscalls/ubuntu_ltp_syscalls.py", line 155, in run_once
  17:15:00 DEBUG|     print(utils.system_output(cmd, verbose=False))
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/shared/utils.py", line 1271, in system_output
  17:15:00 DEBUG|     verbose=verbose, args=args).stdout
  17:15:00 DEBUG|   File "/home/ubuntu/autotest/client/shared/utils.py", line 918, in run
  17:15:00 DEBUG|     "Command returned non-zero exit status")
  17:15:00 DEBUG| CmdError: Command </opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null> failed, rc=-7, Command returned non-zero exit status
  17:15:00 DEBUG| * Command:
  17:15:00 DEBUG|     /opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null
  17:15:00 DEBUG| Exit status: -7
  17:15:00 DEBUG| Duration: 0.00109004974365
  17:15:00 DEBUG|
  17:15:02 INFO | 		ERROR	ubuntu_ltp_syscalls.futex_wait_bitset01	ubuntu_ltp_syscalls.futex_wait_bitset01	timestamp=1729617302	localtime=Oct 22 17:15:02	Command </opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null> failed, rc=-7, Command returned non-zero exit status
    * Command:
        /opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null
    Exit status: -7
    Duration: 0.00109004974365

  -------------------------------------

  [Summary]

    - The problem could potentially occur when madvise01/07/11 misbehaves.
    - Most likely when madvise11 failed to unpoison due to /dev/kmsg overrun.
    - For b:aws-5.4, CONFIG_LOG_BUF_SHIFT has been kept 14 for some reason.
    - This is not a regression nor a critical issue.

  [Problem description]

    Sometimes all the test cases after futex_wake04 fail (flaky).
    - futex_wait_bitset01
    - memfd_create01
    - memfd_create02
    - memfd_create03 [*]
    - memfd_create04
    - copy_file_range01
    - copy_file_range02
    - copy_file_range03
    - statx01
    - statx02
    - statx03
    - statx04
    - statx05
    - statx06
    - statx07
    - statx08
    - statx09
    - statx10
    - statx11
    - statx12
    - membarrier01
    - io_uring01
    - io_uring02
    - perf_event_open03

    [*] Sometimes only memfd_create03 fails.

    During futex_wake04, I observed that a bash process was killed with
  SIGBUS.

    * dmesg:
        [...]
        Oct 22 11:36:20 ip-172-31-21-201 kernel: BTRFS info (device loop7): creating UUID tree
        Oct 22 11:38:56 ip-172-31-21-201 kernel: futex_wake04 (21315): drop_caches: 3
      > Oct 22 11:38:56 ip-172-31-21-201 AutotestCrashHandler[21322]: Application bash, PID 21321 crashed
      > Oct 22 11:38:56 ip-172-31-21-201 AutotestCrashHandler[21322]: Writing core files to \
          ['/home/ubuntu/autotest/client/results/default/ubuntu_ltp_syscalls.futex_wake04/debug/crash.bash.21321']
        Oct 22 11:39:47 ip-172-31-21-201 sudo[2141]: pam_unix(sudo:session): session closed for user root
        [...]

    * where it crashed:
        ubuntu@ip-172-31-21-201:~/_bin_bash.0.crash$ gdb -q /bin/bash -ex bt ./CoreDump
        [...]
        Core was generated by `/bin/bash -c grep MemTotal /proc/meminfo'.
        Program terminated with signal SIGBUS, Bus error.
        [...]
        #0  0x0000ffff996921d4 in elf_machine_rela_relative (reloc_addr_arg=0xaaaae4c443c8 <static_shell_builtins>, \
            reloc=0xaaaae4b60708, l_addr=187650958139392) at ../sysdeps/aarch64/dl-machine.h:376
        #1  elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=0, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, map=0xffff996b61b0) at do-rel.h:112
        #2  _dl_relocate_object (scope=<optimized out>, reloc_mode=<optimized out>, consider_profiling=<optimized out>, consider_profiling@entry=0) at dl-reloc.c:258
        #3  0x0000ffff9968ac90 in dl_main (phdr=<optimized out>, phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:2190
        #4  0x0000ffff9969bd70 in _dl_sysdep_start (start_argptr=start_argptr@entry=0xffffff7b4870, dl_main=dl_main@entry=0xffff996890a0 <dl_main>) at ../elf/dl-sysdep.c:253
        #5  0x0000ffff996888c4 in _dl_start_final (arg=0xffffff7b4870, arg@entry=0xffff996b4e68, info=info@entry=0xffffff7b4400) at rtld.c:414
        #6  0x0000ffff99688b48 in _dl_start (arg=0xffff996b4e68) at rtld.c:523
        #7  0x0000ffff996881c8 in _start () from /lib/ld-linux-aarch64.so.1
        Backtrace stopped: previous frame identical to this frame (corrupt stack?)

      - source view
        ```
        [glibc:import/2.27-3ubuntu1.6:sysdeps/aarch64/dl-machine.h]
          369 inline void
          370 __attribute__ ((always_inline))
          371 elf_machine_rela_relative (ElfW(Addr) l_addr,
          372                            const ElfW(Rela) *reloc,
          373                            void *const reloc_addr_arg)
          374 {
          375   ElfW(Addr) *const reloc_addr = reloc_addr_arg;
        > 376   *reloc_addr = l_addr + reloc->r_addend;
          377 }
        ```
      - asm view
        ```
        > 0xffff996921d4 <_dl_relocate_object+2412>       str    x0, [x1, x22]
        ```
      - reg view
        ```
        (excerpt)
        x0       0xaaaae4c28b40  # l_addr + reloc->r_addend
        x1       0x10a3c8
        x22      0xaaaae4b3a000  # l_addr
        ```

    Apparently it's unusual that dynamic loader crashes on such a normal code path / instruction.
    From this time onward, the system gradually becomes unstable.
    Even a new ssh login fails, running journalctl aborts and cannot see logs, etc.

  [Cause]

    After digging this issue, I figured out that madvise11 test case fails to unpoison soft-offlined pages.
    While neither of madvise01 nor madvise07 unpoisons as well, the amount of poisoned pages which is left
    by madvise11 is relatively huge, so it's more likely to result in unexpected process crashes later.

    * madvise01 : 10 pfns hard-offlined (via MADV_HWPOISON)
    * madvise07 : 1 pfn hard-offlined (via MADV_HWPOISON)
    * madvise11 : 800 pfns soft-offlined

    The reason madvise11 fails to unpoison is /dev/kmsg overrun. Before it starts poisoning,
    it echoes a marker to /dev/kmsg, then after that kernel emits 800 lines of messages
    for each soft-offlining.

      [...]
      kernel: Soft offlining pfn 0xb58bc4 at process virtual address 0xffff8dd4f000
      [...]

    Then /dev/kmsg overruns and the marker becomes unseeable. madvise11 relies on the marker
    in its cleanup stage to search for pfns to unpoison. Because it fails to find the marker,
    it fails to unpoison any poisoned pfn at all. You can see the following line in console:

      16:33:28 INFO | madvise11.c:363: TINFO: Restore 0 Soft-offlined
  pages

    So, there seems to be two problems:

    (A). madvise11 fails to unpoison soft-offlined pages.
    (B). Neither of madvise01 nor madvise07 unpoisons hard-offlined pages.

    Here, (A) is more likely to lead to random process crashes with unexpected SIGBUS.
    Roughly speaking, probabilistically 98% (=800/(10+1+800)) of the issue occurrence
    should have been caused by (A).

  [Solution]

    Possible solutions are described below.

    Regarding (A),
      - (a0). Propagate CONFIG_LOG_BUF_SHIFT=18 for arm64 b:aws-5.4
      - (a1). Set kernel parm `log_buf_len=` to a sufficiently large value,
      - (a2). Reduce the number of pages to offline (800 -> 400) with a patch against ltp
      - (a3). Modify madvise11 codes so that it does not rely on /dev/kmsg
              (i.e., use pagemap and record the poisoned pfns internally for cleanup phase)

    Regarding (B),
      - (b1)  Add cleanup (=unpoison) codes for hard-offlining (via MADV_HWPOISON)

  [Experiment]

    I ran the whole ubuntu_ltp_syscalls four times for each with and
  without the patch for (a2).

       with this patch:    4 pass, 0 fail
       without this patch: 2 pass [*], 2 fail

       [*] even though all the tests but splice07 passed, the system got unstable and
           some programs randomly crashes (e.g., journalctl).

    Of course just setting log_buf_len to a sufficiently large value
  would give us the same result.

  [Why it did not occur previously]

      For a1.medium, it seems that it just happened not to occur.
      There was a line in the log for s2024.09.02 a1.medium:
      01:52:44 INFO | madvise11.c:363: TINFO: Restore 0 Soft-offlined pages
      The issue could potentially occur on a1.medium, c7g.xlarge, m8g.2xlarge.

      arm64 instances     | vCPU | log_buf_len | # soft-offlining | Required buf size (*2) | affected |
      --------------------+------+-------------+------------------+------------------------+----------+
      a1.medium           | 1    | 16KiB       | 320              | 90*320=28800 (>16KiB!) | o        |
      c6g.8xlarge         | 32   | 256KiB (*1) | 800              | 90*800=72000 (<256KiB) |          |
      c7g.xlarge [NEW]    | 4    | 16KiB       | 640              | 90*640=57600 (>16KiB!) | o        |
      m8g.2xlarge [NEW]   | 8    | 64KiB (*1)  | 800              | 90*800=72000 (>64KiB!) | o        |

      (*1): CONFIG_LOG_BUF_SHIFT=14, CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
            See more details in log_buf_add_cpu() why it's larger than 1<<14.
      (*2): One line "kernel: Soft offlining pfn ..." consumes 90 Bytes.

  [Next Step]

      I'll add a UBUNTU SAUCE patch to b:aws-5.4 so that CONFIG_LOG_BUF_SHIFT=18 is properly propagated.
      This is not a critical nor an urgent issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/2085493/+subscriptions
References

[Bug 2085493] [NEW] futex_wait_bitset01, memfd_create0[1-4], copy_file_range0[1-3], statx[01-12], membarrier01, io_uring0[1-2], perf_event_open03 from ubuntu_ltp_syscalls failed on B-aws-5.4
From: Koichiro Den, 2024-10-24