canonical-ubuntu-qa team mailing list archive

Thread
Date
[Bug 2076241] Re: ubuntu_ltp_* tests unable to finish properly with B-azure-fips

To: canonical-ubuntu-qa@xxxxxxxxxxxxxxxxxxx
From: Po-Hsu Lin <2076241@xxxxxxxxxxxxxxxxxx>
Date: Thu, 08 Aug 2024 16:34:15 -0000
Reply-to: Bug 2076241 <2076241@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
** Description changed:

  In sru-s20240429 and sru-s20240610, the ubuntu_ltp_* tests were found
  unable to finish properly with B-azure-fips kernel, and eventually
  trigger the `sut-test` failure on them.
  
  Here is the result from sru-s20240610
  * ubuntu_ltp
    - report cuts-off at fs:fs_fill test, failed on Standard_D4_v4 only.
  * ubuntu_ltp_controllers
    - report cuts-off at memcg_test_3 test, failed on Standard_B1ms
    - report cuts-off at memcg_stress test, failed on Standard_D4_v4, Standard_D4s_v3-gen2
  * ubuntu_ltp_cve
    - report cuts-off at cve-2016-8655 test, failed on Standard_B1ms, Standard_D4_v4
    - report cuts-off at cve-2018-18559 test, failed on Standard_D4s_v3-gen2
  * ubunut_ltp_syscall
    - report cuts-off at setsockopt06 test, failed on Standard_B1ms
    - report cuts-off at bind06 test, failed on Standard_D4_v4, Standard_D4s_v3-gen2
  
  The result from sru-s20240610 is quite similar, just the ubuntu_ltp_cve
  this time cuts-off at cve-2016-8655 test on Standard_D4s_v3-gen2.
  
  Note that the cve-2016-8655 is actually the setsockopt06 test, and
  cve-2018-18559 is the bind06 test.
  
  I have done some experiments on Standard_D4s_v3-gen2 with kernel in sru-s20240610 (4.15.0-2088-azure-fips):
  * ubunut_ltp_controllers:
    - If we skip memcg_stress test, it will be able to finish properly.
  * ubuntu_ltp_cve:
    - If we skip cve-2016-8655 and cve-2018-18559 tests, it will be able to finish properly.
+ * ubuntu_ltp_syscalls:
+   - If we skip bind06 and writev03 tests, it will be able to finish properly (setsockopt06 works fine in this case, not sure why).
+ 
+ Here is the code to skip a certain test:
+ diff --git a/ubuntu_ltp_syscalls/control b/ubuntu_ltp_syscalls/control
+ index 4f93c546..684a8ed2 100644
+ --- a/ubuntu_ltp_syscalls/control
+ +++ b/ubuntu_ltp_syscalls/control
+ @@ -24,6 +24,9 @@ if result == 'GOOD':
+                  # Special case for msgstress04 (lp:1943802 / lp:1943652)
+                  if testcase == 'msgstress04':
+                      timeout_threshold = 60*60
+ +                if testcase in ['bind06', 'writev03'] and platform.release() == '4.15.0-2088-azure-fips':
+ +                    print('skipping bind06 for testing purpose')
+ +                    continue
+                  job.run_test_detail(NAME, test_name=testcase, tag=testcase, timeout=timeout_threshold)
+  else:
+      print("ERROR: test failed to build, skipping all the sub tests")
+ 
  
  With my manual test on Standard_D4_v4 with 4.15.0-2088-azure-fips, I noticed that my idle SSH session will hang after a certain period (I recorded one at about 7m21s). If it's running something, like htop, it will be fine.
  And setsockopt06, bind06 test can pass without any immediate crash. Not sure what is the cause of this failure that we see here.
  
  It's also worthy to note that "running something" seems to limited to
  commands that will keep generating output. Commands like "dmesg -w" and
  "tail -f /var/log/syslog" will hang too if there is no output to update.
  
  According to Magali, the last bionic fips openssh update is from
  January, so this might be something else in the kernel.
  
  == Original bug report ==
  On azure-fips platforms multiple tests in ubuntu_ltp, ubuntu_ltp_controllers, ubuntu_ltp_cve, and ubuntu_ltp_syscalls are causing the system to be unresponsive. When running locally the tests run to completion but the system hangs sometime after.

-- 
You received this bug notification because you are a member of Canonical
Platform QA Team, which is subscribed to ubuntu-kernel-tests.
https://bugs.launchpad.net/bugs/2076241

Title:
  ubuntu_ltp_* tests unable to finish properly with B-azure-fips

Status in ubuntu-kernel-tests:
  New

Bug description:
  In sru-s20240429 and sru-s20240610, the ubuntu_ltp_* tests were found
  unable to finish properly with B-azure-fips kernel, and eventually
  trigger the `sut-test` failure on them.

  Here is the result from sru-s20240610
  * ubuntu_ltp
    - report cuts-off at fs:fs_fill test, failed on Standard_D4_v4 only.
  * ubuntu_ltp_controllers
    - report cuts-off at memcg_test_3 test, failed on Standard_B1ms
    - report cuts-off at memcg_stress test, failed on Standard_D4_v4, Standard_D4s_v3-gen2
  * ubuntu_ltp_cve
    - report cuts-off at cve-2016-8655 test, failed on Standard_B1ms, Standard_D4_v4
    - report cuts-off at cve-2018-18559 test, failed on Standard_D4s_v3-gen2
  * ubunut_ltp_syscall
    - report cuts-off at setsockopt06 test, failed on Standard_B1ms
    - report cuts-off at bind06 test, failed on Standard_D4_v4, Standard_D4s_v3-gen2

  The result from sru-s20240610 is quite similar, just the
  ubuntu_ltp_cve this time cuts-off at cve-2016-8655 test on
  Standard_D4s_v3-gen2.

  Note that the cve-2016-8655 is actually the setsockopt06 test, and
  cve-2018-18559 is the bind06 test.

  I have done some experiments on Standard_D4s_v3-gen2 with kernel in sru-s20240610 (4.15.0-2088-azure-fips):
  * ubunut_ltp_controllers:
    - If we skip memcg_stress test, it will be able to finish properly.
  * ubuntu_ltp_cve:
    - If we skip cve-2016-8655 and cve-2018-18559 tests, it will be able to finish properly.
  * ubuntu_ltp_syscalls:
    - If we skip bind06 and writev03 tests, it will be able to finish properly (setsockopt06 works fine in this case, not sure why).

  Here is the code to skip a certain test:
  diff --git a/ubuntu_ltp_syscalls/control b/ubuntu_ltp_syscalls/control
  index 4f93c546..684a8ed2 100644
  --- a/ubuntu_ltp_syscalls/control
  +++ b/ubuntu_ltp_syscalls/control
  @@ -24,6 +24,9 @@ if result == 'GOOD':
                   # Special case for msgstress04 (lp:1943802 / lp:1943652)
                   if testcase == 'msgstress04':
                       timeout_threshold = 60*60
  +                if testcase in ['bind06', 'writev03'] and platform.release() == '4.15.0-2088-azure-fips':
  +                    print('skipping bind06 for testing purpose')
  +                    continue
                   job.run_test_detail(NAME, test_name=testcase, tag=testcase, timeout=timeout_threshold)
   else:
       print("ERROR: test failed to build, skipping all the sub tests")

  
  With my manual test on Standard_D4_v4 with 4.15.0-2088-azure-fips, I noticed that my idle SSH session will hang after a certain period (I recorded one at about 7m21s). If it's running something, like htop, it will be fine.
  And setsockopt06, bind06 test can pass without any immediate crash. Not sure what is the cause of this failure that we see here.

  It's also worthy to note that "running something" seems to limited to
  commands that will keep generating output. Commands like "dmesg -w"
  and "tail -f /var/log/syslog" will hang too if there is no output to
  update.

  According to Magali, the last bionic fips openssh update is from
  January, so this might be something else in the kernel.

  == Original bug report ==
  On azure-fips platforms multiple tests in ubuntu_ltp, ubuntu_ltp_controllers, ubuntu_ltp_cve, and ubuntu_ltp_syscalls are causing the system to be unresponsive. When running locally the tests run to completion but the system hangs sometime after.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/2076241/+subscriptions
References

[Bug 2076241] [NEW] ubuntu_ltp_* tests completing but causes system to hang
From: Portia Stephens, 2024-08-07