← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1567668] [NEW] Functional job sometimes hits global 2 hour limit and fails

 

Public bug reported:

Here's an example:
http://logs.openstack.org/13/302913/1/check/gate-neutron-dsvm-functional/91dd537/console.html

Logstash query:
build_name:"gate-neutron-dsvm-functional" AND build_status:"FAILURE" AND message:"Killed                  timeout -s 9"

45 hits in the last 7 days.

Ihar and I checked the timing, and it started happening as we merged:
https://review.openstack.org/#/c/298056/

There's a few problems here:
1) It appears like a test is freezing up. We have a per-test timeout defined. The timeout is defined by OS_TEST_TIMEOUT in tox.ini, and is enforced via a fixtures.Timeout fixture set up in the oslotest base class. It looks like that timeout doesn't always work.
2) When the global 2 hours job timeout is hit, it doesn't perform post-tests tasks such as copying over log files, which makes these problems a lot harder to troubleshoot.
3) And of course, there is some sort of issue with likely https://review.openstack.org/#/c/298056/.

We can fix via a revert, which will increase the failure rate of
fullstack. Since I've been unable to reproduce this issue locally, I'd
like to hold off on a revert and try to get some more information by
tackling some combination of problems 1 and 2, and then adding more
logging to figure it out.

** Affects: neutron
     Importance: High
         Status: New


** Tags: functional-tests gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1567668

Title:
  Functional job sometimes hits global 2 hour limit and fails

Status in neutron:
  New

Bug description:
  Here's an example:
  http://logs.openstack.org/13/302913/1/check/gate-neutron-dsvm-functional/91dd537/console.html

  Logstash query:
  build_name:"gate-neutron-dsvm-functional" AND build_status:"FAILURE" AND message:"Killed                  timeout -s 9"

  45 hits in the last 7 days.

  Ihar and I checked the timing, and it started happening as we merged:
  https://review.openstack.org/#/c/298056/

  There's a few problems here:
  1) It appears like a test is freezing up. We have a per-test timeout defined. The timeout is defined by OS_TEST_TIMEOUT in tox.ini, and is enforced via a fixtures.Timeout fixture set up in the oslotest base class. It looks like that timeout doesn't always work.
  2) When the global 2 hours job timeout is hit, it doesn't perform post-tests tasks such as copying over log files, which makes these problems a lot harder to troubleshoot.
  3) And of course, there is some sort of issue with likely https://review.openstack.org/#/c/298056/.

  We can fix via a revert, which will increase the failure rate of
  fullstack. Since I've been unable to reproduce this issue locally, I'd
  like to hold off on a revert and try to get some more information by
  tackling some combination of problems 1 and 2, and then adding more
  logging to figure it out.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1567668/+subscriptions


Follow ups