canonical-ubuntu-qa team mailing list archive
-
canonical-ubuntu-qa team
-
Mailing list archive
-
Message #05665
[Bug 2085493] Re: futex_wait_bitset01, memfd_create0[1-4], copy_file_range0[1-3], statx[01-12], membarrier01, io_uring0[1-2], perf_event_open03 from ubuntu_ltp_syscalls failed on B-aws-5.4
** Tags added: focal
--
You received this bug notification because you are a member of Canonical
Platform QA Team, which is subscribed to ubuntu-kernel-tests.
https://bugs.launchpad.net/bugs/2085493
Title:
futex_wait_bitset01, memfd_create0[1-4], copy_file_range0[1-3],
statx[01-12], membarrier01, io_uring0[1-2], perf_event_open03 from
ubuntu_ltp_syscalls failed on B-aws-5.4
Status in ubuntu-kernel-tests:
New
Bug description:
Occasionally, multiple test cases from ubuntu_ltp_syscalls fail on
b:aws-5.4 with the same pettern. The following is futex_wait_bitset01
failure log as an example.
17:15:00 ERROR| Exception escaping from test:
Traceback (most recent call last):
File "/home/ubuntu/autotest/client/shared/test.py", line 411, in _exec
_call_test_function(self.execute, *p_args, **p_dargs)
File "/home/ubuntu/autotest/client/shared/test.py", line 823, in _call_test_function
return func(*args, **dargs)
File "/home/ubuntu/autotest/client/shared/test.py", line 291, in execute
postprocess_profiled_run, args, dargs)
File "/home/ubuntu/autotest/client/shared/test.py", line 212, in _call_run_once
self.run_once(*args, **dargs)
File "/home/ubuntu/autotest/client/tests/ubuntu_ltp_syscalls/ubuntu_ltp_syscalls.py", line 155, in run_once
print(utils.system_output(cmd, verbose=False))
File "/home/ubuntu/autotest/client/shared/utils.py", line 1271, in system_output
verbose=verbose, args=args).stdout
File "/home/ubuntu/autotest/client/shared/utils.py", line 918, in run
"Command returned non-zero exit status")
CmdError: Command </opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null> failed, rc=-7, Command returned non-zero exit status
* Command:
/opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null
Exit status: -7
Duration: 0.00109004974365
17:15:00 INFO | post-test sysinfo error:
17:15:00 INFO | Traceback (most recent call last):
17:15:00 INFO | File "/home/ubuntu/autotest/client/shared/log.py", line 74, in decorated_func
17:15:00 INFO | fn(*args, **dargs)
17:15:00 INFO | File "/home/ubuntu/autotest/client/base_sysinfo.py", line 304, in log_after_each_test
17:15:00 INFO | keyval = self.log_test_keyvals(test_sysinfodir)
17:15:00 INFO | File "/home/ubuntu/autotest/client/base_sysinfo.py", line 413, in log_test_keyvals
17:15:00 INFO | keyval["sysinfo-phys-mbytes"] = utils_memory.rounded_memtotal() // 1024
17:15:00 INFO | File "/home/ubuntu/autotest/client/shared/utils_memory.py", line 25, in rounded_memtotal
17:15:00 INFO | usable_kbytes = memtotal()
17:15:00 INFO | File "/home/ubuntu/autotest/client/shared/utils_memory.py", line 16, in memtotal
17:15:00 INFO | return read_from_meminfo('MemTotal')
17:15:00 INFO | File "/home/ubuntu/autotest/client/shared/utils_memory.py", line 10, in read_from_meminfo
17:15:00 INFO | cmd_result = utils.run('grep %s /proc/meminfo' % key, verbose=False)
17:15:00 INFO | File "/home/ubuntu/autotest/client/shared/utils.py", line 918, in run
17:15:00 INFO | "Command returned non-zero exit status")
17:15:00 INFO | CmdError: Command <grep MemTotal /proc/meminfo> failed, rc=-7, Command returned non-zero exit status
17:15:00 INFO | * Command:
17:15:00 INFO | grep MemTotal /proc/meminfo
17:15:00 INFO | Exit status: -7
17:15:00 INFO | Duration: 0.00239610671997
17:15:00 INFO |
17:15:00 ERROR| child process failed
17:15:00 DEBUG| Traceback (most recent call last):
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/parallel.py", line 25, in fork_start
17:15:00 DEBUG| l()
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/job.py", line 505, in <lambda>
17:15:00 DEBUG| l = lambda: test.runtest(self, url, tag, args, dargs)
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/test.py", line 125, in runtest
17:15:00 DEBUG| job.sysinfo.log_after_each_iteration)
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/test.py", line 913, in runtest
17:15:00 DEBUG| mytest._exec(args, dargs)
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/test.py", line 411, in _exec
17:15:00 DEBUG| _call_test_function(self.execute, *p_args, **p_dargs)
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/test.py", line 823, in _call_test_function
17:15:00 DEBUG| return func(*args, **dargs)
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/test.py", line 291, in execute
17:15:00 DEBUG| postprocess_profiled_run, args, dargs)
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/test.py", line 212, in _call_run_once
17:15:00 DEBUG| self.run_once(*args, **dargs)
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/tests/ubuntu_ltp_syscalls/ubuntu_ltp_syscalls.py", line 155, in run_once
17:15:00 DEBUG| print(utils.system_output(cmd, verbose=False))
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/utils.py", line 1271, in system_output
17:15:00 DEBUG| verbose=verbose, args=args).stdout
17:15:00 DEBUG| File "/home/ubuntu/autotest/client/shared/utils.py", line 918, in run
17:15:00 DEBUG| "Command returned non-zero exit status")
17:15:00 DEBUG| CmdError: Command </opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null> failed, rc=-7, Command returned non-zero exit status
17:15:00 DEBUG| * Command:
17:15:00 DEBUG| /opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null
17:15:00 DEBUG| Exit status: -7
17:15:00 DEBUG| Duration: 0.00109004974365
17:15:00 DEBUG|
17:15:02 INFO | ERROR ubuntu_ltp_syscalls.futex_wait_bitset01 ubuntu_ltp_syscalls.futex_wait_bitset01 timestamp=1729617302 localtime=Oct 22 17:15:02 Command </opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null> failed, rc=-7, Command returned non-zero exit status
* Command:
/opt/ltp/runltp -f /tmp/target -q -C /dev/null -l /dev/null -T /dev/null
Exit status: -7
Duration: 0.00109004974365
-------------------------------------
[Summary]
- The problem could potentially occur when madvise01/07/11 misbehaves.
- Most likely when madvise11 failed to unpoison due to /dev/kmsg overrun.
- For b:aws-5.4, CONFIG_LOG_BUF_SHIFT has been kept 14 for some reason.
- This is not a regression nor a critical issue.
[Problem description]
Sometimes all the test cases after futex_wake04 fail (flaky).
- futex_wait_bitset01
- memfd_create01
- memfd_create02
- memfd_create03 [*]
- memfd_create04
- copy_file_range01
- copy_file_range02
- copy_file_range03
- statx01
- statx02
- statx03
- statx04
- statx05
- statx06
- statx07
- statx08
- statx09
- statx10
- statx11
- statx12
- membarrier01
- io_uring01
- io_uring02
- perf_event_open03
[*] Sometimes only memfd_create03 fails.
During futex_wake04, I observed that a bash process was killed with
SIGBUS.
* dmesg:
[...]
Oct 22 11:36:20 ip-172-31-21-201 kernel: BTRFS info (device loop7): creating UUID tree
Oct 22 11:38:56 ip-172-31-21-201 kernel: futex_wake04 (21315): drop_caches: 3
> Oct 22 11:38:56 ip-172-31-21-201 AutotestCrashHandler[21322]: Application bash, PID 21321 crashed
> Oct 22 11:38:56 ip-172-31-21-201 AutotestCrashHandler[21322]: Writing core files to \
['/home/ubuntu/autotest/client/results/default/ubuntu_ltp_syscalls.futex_wake04/debug/crash.bash.21321']
Oct 22 11:39:47 ip-172-31-21-201 sudo[2141]: pam_unix(sudo:session): session closed for user root
[...]
* where it crashed:
ubuntu@ip-172-31-21-201:~/_bin_bash.0.crash$ gdb -q /bin/bash -ex bt ./CoreDump
[...]
Core was generated by `/bin/bash -c grep MemTotal /proc/meminfo'.
Program terminated with signal SIGBUS, Bus error.
[...]
#0 0x0000ffff996921d4 in elf_machine_rela_relative (reloc_addr_arg=0xaaaae4c443c8 <static_shell_builtins>, \
reloc=0xaaaae4b60708, l_addr=187650958139392) at ../sysdeps/aarch64/dl-machine.h:376
#1 elf_dynamic_do_Rela (skip_ifunc=<optimized out>, lazy=0, nrelative=<optimized out>, relsize=<optimized out>, reladdr=<optimized out>, map=0xffff996b61b0) at do-rel.h:112
#2 _dl_relocate_object (scope=<optimized out>, reloc_mode=<optimized out>, consider_profiling=<optimized out>, consider_profiling@entry=0) at dl-reloc.c:258
#3 0x0000ffff9968ac90 in dl_main (phdr=<optimized out>, phnum=<optimized out>, user_entry=<optimized out>, auxv=<optimized out>) at rtld.c:2190
#4 0x0000ffff9969bd70 in _dl_sysdep_start (start_argptr=start_argptr@entry=0xffffff7b4870, dl_main=dl_main@entry=0xffff996890a0 <dl_main>) at ../elf/dl-sysdep.c:253
#5 0x0000ffff996888c4 in _dl_start_final (arg=0xffffff7b4870, arg@entry=0xffff996b4e68, info=info@entry=0xffffff7b4400) at rtld.c:414
#6 0x0000ffff99688b48 in _dl_start (arg=0xffff996b4e68) at rtld.c:523
#7 0x0000ffff996881c8 in _start () from /lib/ld-linux-aarch64.so.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
- source view
```
[glibc:import/2.27-3ubuntu1.6:sysdeps/aarch64/dl-machine.h]
369 inline void
370 __attribute__ ((always_inline))
371 elf_machine_rela_relative (ElfW(Addr) l_addr,
372 const ElfW(Rela) *reloc,
373 void *const reloc_addr_arg)
374 {
375 ElfW(Addr) *const reloc_addr = reloc_addr_arg;
> 376 *reloc_addr = l_addr + reloc->r_addend;
377 }
```
- asm view
```
> 0xffff996921d4 <_dl_relocate_object+2412> str x0, [x1, x22]
```
- reg view
```
(excerpt)
x0 0xaaaae4c28b40 # l_addr + reloc->r_addend
x1 0x10a3c8
x22 0xaaaae4b3a000 # l_addr
```
Apparently it's unusual that dynamic loader crashes on such a normal code path / instruction.
From this time onward, the system gradually becomes unstable.
Even a new ssh login fails, running journalctl aborts and cannot see logs, etc.
[Cause]
After digging this issue, I figured out that madvise11 test case fails to unpoison soft-offlined pages.
While neither of madvise01 nor madvise07 unpoisons as well, the amount of poisoned pages which is left
by madvise11 is relatively huge, so it's more likely to result in unexpected process crashes later.
* madvise01 : 10 pfns hard-offlined (via MADV_HWPOISON)
* madvise07 : 1 pfn hard-offlined (via MADV_HWPOISON)
* madvise11 : 800 pfns soft-offlined
The reason madvise11 fails to unpoison is /dev/kmsg overrun. Before it starts poisoning,
it echoes a marker to /dev/kmsg, then after that kernel emits 800 lines of messages
for each soft-offlining.
[...]
kernel: Soft offlining pfn 0xb58bc4 at process virtual address 0xffff8dd4f000
[...]
Then /dev/kmsg overruns and the marker becomes unseeable. madvise11 relies on the marker
in its cleanup stage to search for pfns to unpoison. Because it fails to find the marker,
it fails to unpoison any poisoned pfn at all. You can see the following line in console:
16:33:28 INFO | madvise11.c:363: TINFO: Restore 0 Soft-offlined
pages
So, there seems to be two problems:
(A). madvise11 fails to unpoison soft-offlined pages.
(B). Neither of madvise01 nor madvise07 unpoisons hard-offlined pages.
Here, (A) is more likely to lead to random process crashes with unexpected SIGBUS.
Roughly speaking, probabilistically 98% (=800/(10+1+800)) of the issue occurrence
should have been caused by (A).
[Solution]
Possible solutions are described below.
Regarding (A),
- (a0). Propagate CONFIG_LOG_BUF_SHIFT=18 for arm64 b:aws-5.4
- (a1). Set kernel parm `log_buf_len=` to a sufficiently large value,
- (a2). Reduce the number of pages to offline (800 -> 400) with a patch against ltp
- (a3). Modify madvise11 codes so that it does not rely on /dev/kmsg
(i.e., use pagemap and record the poisoned pfns internally for cleanup phase)
Regarding (B),
- (b1) Add cleanup (=unpoison) codes for hard-offlining (via MADV_HWPOISON)
[Experiment]
I ran the whole ubuntu_ltp_syscalls four times for each with and
without the patch for (a2).
with this patch: 4 pass, 0 fail
without this patch: 2 pass [*], 2 fail
[*] even though all the tests but splice07 passed, the system got unstable and
some programs randomly crashes (e.g., journalctl).
Of course just setting log_buf_len to a sufficiently large value
would give us the same result.
[Why it did not occur previously]
For a1.medium, it seems that it just happened not to occur.
There was a line in the log for s2024.09.02 a1.medium:
01:52:44 INFO | madvise11.c:363: TINFO: Restore 0 Soft-offlined pages
The issue could potentially occur on a1.medium, c7g.xlarge, m8g.2xlarge.
arm64 instances | vCPU | log_buf_len | # soft-offlining | Required buf size (*2) | affected |
--------------------+------+-------------+------------------+------------------------+----------+
a1.medium | 1 | 16KiB | 320 | 90*320=28800 (>16KiB!) | o |
c6g.8xlarge | 32 | 256KiB (*1) | 800 | 90*800=72000 (<256KiB) | |
c7g.xlarge [NEW] | 4 | 16KiB | 640 | 90*640=57600 (>16KiB!) | o |
m8g.2xlarge [NEW] | 8 | 64KiB (*1) | 800 | 90*800=72000 (>64KiB!) | o |
(*1): CONFIG_LOG_BUF_SHIFT=14, CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
See more details in log_buf_add_cpu() why it's larger than 1<<14.
(*2): One line "kernel: Soft offlining pfn ..." consumes 90 Bytes.
[Next Step]
I'll add a UBUNTU SAUCE patch to b:aws-5.4 so that CONFIG_LOG_BUF_SHIFT=18 is properly propagated.
This is not a critical nor an urgent issue.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/2085493/+subscriptions
References