canonical-ubuntu-qa team mailing list archive
-
canonical-ubuntu-qa team
-
Mailing list archive
-
Message #05890
[Bug 2088047] Re: log_check / kernel_tainted test from ubuntu_boot failed on Oracular AWS a1.metal
As of 2024/12/11, fresh EC2 instances of the above-mentioned types still
exhibit the issue, so I sent patch to ML:
https://lists.ubuntu.com/archives/kernel-team/2024-December/155888.html
SRU Justification:
[Impact]
On AWS EC2 a1.metal and c6g.8xlarge instances, issues caused by unexpected
discrepancies within ACPI table have been observed. These are indicated by
kernel warn messages and taint but, more critically, result in fewer usable
CPUs than expected for end users:
* Oracular on a1.metal: paying for 16 CPUs but only 4 are usable
* Oracular on c6g.8xlarge: paying for 32 CPUs but only 16 are usable
Given the possibility for other problematic ACPI table patterns, disable
ACPI_HOTPLUG_CPU on arm64 instance types until the issues are fixed at the
firmware level.
Note: as of 2024/12/11, fresh EC2 instances of the above-mentioned types
still exhibit the issue.
[Test Plan]
With this patch applied, verify that the kernel warning messages no longer
appear. Additionally, easily confirm with pre-installed tools like htop
that all expected CPUs are actively usable.
[Where problems could occur]
The likelihood of regression is minimal. It's also unlikely that end users
expect to perform virtual hot-unplugging for some instance types, especially
since they are basically billed on an hourly basis.
[Other Info]
This forcefully disables the option by editting Kconfig file because
ACPI_PROCESSOR=n did not work on real instances and HOTPLUG_CPU cannot be
disabled without side effects. That's why I added "UBUNTU: SAUCE" to the
Subject line, rather than "UBUNTU: [Config]".
--
You received this bug notification because you are a member of Canonical
Platform QA Team, which is subscribed to ubuntu-kernel-tests.
https://bugs.launchpad.net/bugs/2088047
Title:
log_check / kernel_tainted test from ubuntu_boot failed on Oracular
AWS a1.metal
Status in ubuntu-kernel-tests:
New
Bug description:
Found on Oracular/6.11.0-11.11 boot testing on AWS a1.metal instance.
The relevant console log excerpts:
-----(snip)-----
06:55:12 INFO | 2024-11-09T06:51:17.584884+00:00 ip-172-31-6-235 kernel: cpuinfo: failed to register hotplug callbacks.
-----(snip)-----
06:55:12 INFO | 2024-11-09T06:51:17.584978+00:00 ip-172-31-6-235 kernel: ------------[ cut here ]------------
06:55:12 INFO | 2024-11-09T06:51:17.584980+00:00 ip-172-31-6-235 kernel: WARNING: CPU: 7 PID: 1 at fs/sysfs/group.c:128 internal_create_group+0xc4/0x380
06:55:12 INFO | 2024-11-09T06:51:17.584981+00:00 ip-172-31-6-235 kernel: Modules linked in:
06:55:12 INFO | 2024-11-09T06:51:17.584983+00:00 ip-172-31-6-235 kernel: CPU: 7 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.11.0-11-generic #11-Ubuntu
06:55:12 INFO | 2024-11-09T06:51:17.584984+00:00 ip-172-31-6-235 kernel: Hardware name: Amazon EC2 a1.metal/Not Specified, BIOS 1.0 10/16/2017
06:55:12 INFO | 2024-11-09T06:51:17.584985+00:00 ip-172-31-6-235 kernel: pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
06:55:12 INFO | 2024-11-09T06:51:17.584987+00:00 ip-172-31-6-235 kernel: pc : internal_create_group+0xc4/0x380
06:55:12 INFO | 2024-11-09T06:51:17.584989+00:00 ip-172-31-6-235 kernel: lr : sysfs_create_group+0x24/0x50
06:55:12 INFO | 2024-11-09T06:51:17.584993+00:00 ip-172-31-6-235 kernel: sp : ffff80008009bb90
06:55:12 INFO | 2024-11-09T06:51:17.584995+00:00 ip-172-31-6-235 kernel: x29: ffff80008009bba0 x28: 0000000000000000 x27: ffff19093bd33ca8
06:55:12 INFO | 2024-11-09T06:51:17.584997+00:00 ip-172-31-6-235 kernel: x26: 0000000000000000 x25: ffff436d28704000 x24: ffffd59c11b04a88
06:55:12 INFO | 2024-11-09T06:51:17.584998+00:00 ip-172-31-6-235 kernel: x23: 0000000000000000 x22: ffffd59c14046768 x21: ffffd59c1362fca8
06:55:12 INFO | 2024-11-09T06:51:17.585000+00:00 ip-172-31-6-235 kernel: x20: 0000000000000036 x19: 0000000000000004 x18: ffff800080095060
06:55:12 INFO | 2024-11-09T06:51:17.585001+00:00 ip-172-31-6-235 kernel: x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
06:55:12 INFO | 2024-11-09T06:51:17.585003+00:00 ip-172-31-6-235 kernel: x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
06:55:12 INFO | 2024-11-09T06:51:17.585006+00:00 ip-172-31-6-235 kernel: x11: 0000000000000000 x10: 0000000000000000 x9 : ffffd59c1128fc4c
06:55:12 INFO | 2024-11-09T06:51:17.585008+00:00 ip-172-31-6-235 kernel: x8 : 0101010101010101 x7 : 0000000000000000 x6 : 0000000000000000
06:55:12 INFO | 2024-11-09T06:51:17.585010+00:00 ip-172-31-6-235 kernel: x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff1902003fa280
06:55:12 INFO | 2024-11-09T06:51:17.585011+00:00 ip-172-31-6-235 kernel: x2 : ffffd59c12648f88 x1 : 0000000000000000 x0 : 0000000000000000
06:55:12 INFO | 2024-11-09T06:51:17.585012+00:00 ip-172-31-6-235 kernel: Call trace:
06:55:12 INFO | 2024-11-09T06:51:17.585013+00:00 ip-172-31-6-235 kernel: internal_create_group+0xc4/0x380
06:55:12 INFO | 2024-11-09T06:51:17.585014+00:00 ip-172-31-6-235 kernel: sysfs_create_group+0x24/0x50
06:55:12 INFO | 2024-11-09T06:51:17.585015+00:00 ip-172-31-6-235 kernel: topology_add_dev+0x28/0x50
06:55:12 INFO | 2024-11-09T06:51:17.585016+00:00 ip-172-31-6-235 kernel: cpuhp_invoke_callback+0x200/0x780
06:55:12 INFO | 2024-11-09T06:51:17.585021+00:00 ip-172-31-6-235 kernel: cpuhp_issue_call+0x100/0x198
06:55:12 INFO | 2024-11-09T06:51:17.585023+00:00 ip-172-31-6-235 kernel: __cpuhp_setup_state_cpuslocked+0x128/0x330
06:55:12 INFO | 2024-11-09T06:51:17.585024+00:00 ip-172-31-6-235 kernel: __cpuhp_setup_state+0x5c/0xa8
06:55:12 INFO | 2024-11-09T06:51:17.585025+00:00 ip-172-31-6-235 kernel: topology_sysfs_init+0x40/0x78
06:55:12 INFO | 2024-11-09T06:51:17.585026+00:00 ip-172-31-6-235 kernel: do_one_initcall+0x64/0x3a0
06:55:12 INFO | 2024-11-09T06:51:17.585027+00:00 ip-172-31-6-235 kernel: do_initcalls+0x19c/0x210
06:55:12 INFO | 2024-11-09T06:51:17.585028+00:00 ip-172-31-6-235 kernel: kernel_init_freeable+0x18c/0x1e8
06:55:12 INFO | 2024-11-09T06:51:17.585029+00:00 ip-172-31-6-235 kernel: kernel_init+0x3c/0x190
06:55:12 INFO | 2024-11-09T06:51:17.585031+00:00 ip-172-31-6-235 kernel: ret_from_fork+0x10/0x20
06:55:12 INFO | 2024-11-09T06:51:17.585035+00:00 ip-172-31-6-235 kernel: ---[ end trace 0000000000000000 ]---
06:55:12 INFO | 2024-11-09T06:51:17.585037+00:00 ip-172-31-6-235 kernel: sysfs: cannot create duplicate filename '/devices/cache'
06:55:12 INFO | 2024-11-09T06:51:17.585038+00:00 ip-172-31-6-235 kernel: CPU: 5 UID: 0 PID: 47 Comm: cpuhp/5 Tainted: G W 6.11.0-11-generic #11-Ubuntu
06:55:12 INFO | 2024-11-09T06:51:17.585039+00:00 ip-172-31-6-235 kernel: Tainted: [W]=WARN
06:55:12 INFO | 2024-11-09T06:51:17.585040+00:00 ip-172-31-6-235 kernel: Hardware name: Amazon EC2 a1.metal/Not Specified, BIOS 1.0 10/16/2017
06:55:12 INFO | 2024-11-09T06:51:17.585041+00:00 ip-172-31-6-235 kernel: Call trace:
06:55:12 INFO | 2024-11-09T06:51:17.585146+00:00 ip-172-31-6-235 kernel: dump_backtrace+0x104/0x160
06:55:12 INFO | 2024-11-09T06:51:17.585149+00:00 ip-172-31-6-235 kernel: show_stack+0x24/0x50
06:55:12 INFO | 2024-11-09T06:51:17.585150+00:00 ip-172-31-6-235 kernel: dump_stack_lvl+0x84/0xc0
06:55:12 INFO | 2024-11-09T06:51:17.585155+00:00 ip-172-31-6-235 kernel: dump_stack+0x1c/0x40
06:55:12 INFO | 2024-11-09T06:51:17.585191+00:00 ip-172-31-6-235 kernel: sysfs_warn_dup+0xa8/0xf0
06:55:12 INFO | 2024-11-09T06:51:17.585193+00:00 ip-172-31-6-235 kernel: sysfs_create_dir_ns+0x124/0x150
06:55:12 INFO | 2024-11-09T06:51:17.585194+00:00 ip-172-31-6-235 kernel: create_dir+0x30/0x120
06:55:12 INFO | 2024-11-09T06:51:17.585215+00:00 ip-172-31-6-235 kernel: kobject_add_internal+0x90/0x240
06:55:12 INFO | 2024-11-09T06:51:17.585218+00:00 ip-172-31-6-235 kernel: kobject_add+0xa0/0x140
06:55:12 INFO | 2024-11-09T06:51:17.585234+00:00 ip-172-31-6-235 kernel: device_add+0xd8/0x748
06:55:12 INFO | 2024-11-09T06:51:17.585236+00:00 ip-172-31-6-235 kernel: cpu_device_create+0x19c/0x1c0
06:55:12 INFO | 2024-11-09T06:51:17.585238+00:00 ip-172-31-6-235 kernel: cache_add_dev+0x84/0x428
06:55:12 INFO | 2024-11-09T06:51:17.585252+00:00 ip-172-31-6-235 kernel: cacheinfo_cpu_online+0x90/0x138
06:55:12 INFO | 2024-11-09T06:51:17.585254+00:00 ip-172-31-6-235 kernel: cpuhp_invoke_callback+0x200/0x780
06:55:12 INFO | 2024-11-09T06:51:17.585256+00:00 ip-172-31-6-235 kernel: cpuhp_thread_fun+0x140/0x358
06:55:12 INFO | 2024-11-09T06:51:17.585281+00:00 ip-172-31-6-235 kernel: smpboot_thread_fn+0x224/0x250
06:55:12 INFO | 2024-11-09T06:51:17.585287+00:00 ip-172-31-6-235 kernel: kthread+0xf4/0x108
06:55:12 INFO | 2024-11-09T06:51:17.585289+00:00 ip-172-31-6-235 kernel: ret_from_fork+0x10/0x20
06:55:12 INFO | 2024-11-09T06:51:17.585299+00:00 ip-172-31-6-235 kernel: kobject: kobject_add_internal failed for cache with -EEXIST, don't try to register things with the same name in the same directory.
This also was observed on 6.11.0-1004-aws and 6.11.0-1005-aws.
Note that Noble is not affected. See [Affected versions] section for more details.
-------------------------------------
[Summary]
- This is not a regression but caused by problematic ACPI table on a1.metal.
- If ACPI table won't be fixed soon, it might be an option to add a workaround at least in our tree. Please see more details in section [Solution]
[Cause]
According to the warn messages, the following two are failing:
* cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "arm64/cpuinfo:online",
cpuid_cpu_online, cpuid_cpu_offline)
* cpuhp_setup_state(CPUHP_AP_BASE_CACHEINFO_ONLINE, "base/cacheinfo:online",
cacheinfo_cpu_online, cacheinfo_cpu_pre_down)
Note that there are other cpuhp callbacks that are failing. Boot-
time tracing of cpuhp:* events reveals it:
4) | /* cpuhp_enter: cpu: 0004 target: 238 step: 199 (cpu_capacity_sysctl_add) */
4) | /* cpuhp_exit: cpu: 0004 state: 238 step: 199 ret: -2 */
4) | /* cpuhp_enter: cpu: 0004 target: 238 step: 199 (cpuid_cpu_online) */
4) | /* cpuhp_exit: cpu: 0004 state: 238 step: 199 ret: -19 */
5) | /* cpuhp_enter: cpu: 0004 target: 238 step: 54 (topology_add_dev) */
5) | /* cpuhp_exit: cpu: 0004 state: 238 step: 54 ret: -22 */
5) | /* cpuhp_enter: cpu: 0005 target: 238 step: 193 (cacheinfo_cpu_online) */
5) | /* cpuhp_exit: cpu: 0005 state: 238 step: 193 ret: -17 */
These failures are due to non-enabled CPU#4-15 despite that they are in cpu_possible_mask and also online.
The issue is that acpi_get_phys_id() fails to get phys_id for processor devices (CPU#4-15) because of
discrepancies in ACPI table.
-> acpi_processor_get_info
-> acpi_get_phys_id
-> map_mat_entry
-> map_madt_entry
Processor Device _UIDs are sequential numbers starting from 0, while Processor UIDs in MADT/PPTT
are non-sequential (0x0, 0x1, 0x2, 0x3, 0x100, 0x101, 0x102, 0x103, 0x200, 0x201, ...).
This results in the map_madt_entry() failure for CPU#4-15.
[Affected Versions]
* All Oracular kernels are affected at the moment.
* All Noble kernels are not affected at the moment.
This is because only Oracular set CONFIG_ACPI_HOTPLUG_CPU=y because of the two upstream commits:
9d0873892f4d ("arm64: Kconfig: Enable hotplug CPU on arm64 if ACPI_PROCESSOR is enabled.")
46800e38ef0e ("arm64: Kconfig: Fix dependencies to enable ACPI_HOTPLUG_CPU")
which are originally included in its master kernel.
[Solution]
There are some options:
(a). override ACPI table (while waiting for firmware update)
(b). apply a workaround patch for o:aws
(c). set CONFIG_ACPI_HOTPLUG_CPU=n in some way
[Experiment]
Regarding (b), I cooked up a workaround patch (dirty hack), and confirmed that acpi_processor_get_info()
turns to succeed for all CPU#4-15 and the warn messages disappeared. See the attached.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/2088047/+subscriptions
References