yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2052787] [NEW] SSH timeouts due to problems with metadata server in ML2/OVN backend

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Slawek Kaplonski <2052787@xxxxxxxxxxxxxxxxxx>
Date: Fri, 09 Feb 2024 09:58:31 -0000
Reply-to: Bug 2052787 <2052787@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

It was visible in couple of jobs already that random tempest scenario jobs are failing due to timeout while SSHing to the guest vm.
In the VM's console log there is clearly problem with reaching metadata server:

2024-02-02 17:37:28.665832 | controller | forked to background, child pid 250
2024-02-02 17:37:28.665857 | controller | OK
2024-02-02 17:37:28.665883 | controller | checking http://169.254.169.254/2009-04-04/instance-id
2024-02-02 17:37:28.665908 | controller | failed 1/20: up 26.07. request failed
2024-02-02 17:37:28.665933 | controller | failed 2/20: up 28.37. request failed
2024-02-02 17:37:28.665958 | controller | failed 3/20: up 30.67. request failed
2024-02-02 17:37:28.665983 | controller | failed 4/20: up 32.96. request failed
2024-02-02 17:37:28.666008 | controller | failed 5/20: up 82.24. request failed
2024-02-02 17:37:28.666033 | controller | failed 6/20: up 131.56. request failed


When looking at the logs of the neutron-ovn-metadata-agent and then journal log it seems for me that those requests are never delivered to the haproxy spawned in the ovnmeta-xxx namespace as there is no any log with the log-tag configured in haproxy for that network.

Examples of failures like that:
https://3c8c3cc132d3ca41c1a0-8df332a8f6cbb54ee498032ff97f9d17.ssl.cf1.rackcdn.com/882350/2/check/cinder-plugin-ceph-tempest-mn-aa/df2995a/job-output.txt
https://ac3deee033df2f80309a-9b1010a8ed0ed23e4a7e66dfa043a295.ssl.cf5.rackcdn.com/907418/2/check/tempest-slow-py3/6dff044/job-output.txt

** Affects: neutron
     Importance: Critical
     Assignee: Slawek Kaplonski (slaweq)
         Status: Confirmed


** Tags: gate-failure tempest

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2052787

Title:
  SSH timeouts due to problems with metadata server in ML2/OVN backend

Status in neutron:
  Confirmed

Bug description:
  It was visible in couple of jobs already that random tempest scenario jobs are failing due to timeout while SSHing to the guest vm.
  In the VM's console log there is clearly problem with reaching metadata server:

  2024-02-02 17:37:28.665832 | controller | forked to background, child pid 250
  2024-02-02 17:37:28.665857 | controller | OK
  2024-02-02 17:37:28.665883 | controller | checking http://169.254.169.254/2009-04-04/instance-id
  2024-02-02 17:37:28.665908 | controller | failed 1/20: up 26.07. request failed
  2024-02-02 17:37:28.665933 | controller | failed 2/20: up 28.37. request failed
  2024-02-02 17:37:28.665958 | controller | failed 3/20: up 30.67. request failed
  2024-02-02 17:37:28.665983 | controller | failed 4/20: up 32.96. request failed
  2024-02-02 17:37:28.666008 | controller | failed 5/20: up 82.24. request failed
  2024-02-02 17:37:28.666033 | controller | failed 6/20: up 131.56. request failed

  
  When looking at the logs of the neutron-ovn-metadata-agent and then journal log it seems for me that those requests are never delivered to the haproxy spawned in the ovnmeta-xxx namespace as there is no any log with the log-tag configured in haproxy for that network.

  Examples of failures like that:
  https://3c8c3cc132d3ca41c1a0-8df332a8f6cbb54ee498032ff97f9d17.ssl.cf1.rackcdn.com/882350/2/check/cinder-plugin-ceph-tempest-mn-aa/df2995a/job-output.txt
  https://ac3deee033df2f80309a-9b1010a8ed0ed23e4a7e66dfa043a295.ssl.cf5.rackcdn.com/907418/2/check/tempest-slow-py3/6dff044/job-output.txt

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2052787/+subscriptions