← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1839515] Re: Weird functional test failures hitting neutron API in unrelated resize flows since 8/5

 

Reviewed:  https://review.opendev.org/675553
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f875c9d12fa03e1eb0d6ac0f3dd95f502ae1e6a1
Submitter: Zuul
Branch:    master

commit f875c9d12fa03e1eb0d6ac0f3dd95f502ae1e6a1
Author: Balazs Gibizer <balazs.gibizer@xxxxxxxx>
Date:   Fri Aug 9 09:53:45 2019 +0200

    Prevent init_host test to interfere with other tests
    
    The test_migrate_disk_and_power_off_crash_finish_revert_migration test
    needs to simulate a compute host crash at a certain point. It stops the
    execution at a certain point by injecting a sleep then simulating a
    compute restart. However the sleep is just 30 seconds which allows the
    stopped function to return while other functional tests are running in
    the same test worker process making those tests fail in a weird way.
    
    One simple solution is to add a big enough sleep to the test that will
    never return before the whole functional test execution. This patch
    proposes a million seconds which is more than 277 hours. Similar to how
    the other test in this test package works. This solution is hacky but
    simple. A better solution would be to further enhance the capabilities
    of the functional test env supporting nova-compute service crash / kill
    + restart.
    
    Change-Id: Ib0d142806804e9113dd61d3a7ec15a98232775c8
    Closes-Bug: #1839515


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1839515

Title:
  Weird functional test failures hitting neutron API in unrelated resize
  flows since 8/5

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Noticed here:

  https://logs.opendev.org/32/634832/43/check/nova-tox-functional-
  py36/d4f3be5/testr_results.html.gz

  With this test:

  nova.tests.functional.notification_sample_tests.test_service.TestServiceUpdateNotificationSampleLatest.test_service_disabled

  That's a simple test which disables a service and then asserts there
  is a service.update notification, but there is another notification
  happening as well:

  
  Traceback (most recent call last):
    File "/home/zuul/src/opendev.org/openstack/nova/nova/tests/functional/notification_sample_tests/test_service.py", line 122, in test_service_disabled
      'uuid': self.service_uuid})
    File "/home/zuul/src/opendev.org/openstack/nova/nova/tests/functional/notification_sample_tests/test_service.py", line 37, in _verify_notification
      base._verify_notification(sample_file_name, replacements, actual)
    File "/home/zuul/src/opendev.org/openstack/nova/nova/tests/functional/notification_sample_tests/notification_sample_base.py", line 148, in _verify_notification
      self.assertEqual(1, len(fake_notifier.VERSIONED_NOTIFICATIONS))
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/testtools/testcase.py", line 411, in assertEqual
      self.assertThat(observed, matcher, message)
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/testtools/testcase.py", line 498, in assertThat
      raise mismatch_error
  testtools.matchers._impl.MismatchError: 1 != 2

  And in the error output, we can see this weird traceback of a resize
  revert failure b/c the NeutronFixture isn't being used:

  2019-08-07 23:22:23,621 ERROR [nova.network.neutronv2.api] The [neutron] section of your nova configuration file must be configured for authentication with the networking service endpoint. See the networking service install guide for details: https://docs.openstack.org/neutron/latest/install/
  2019-08-07 23:22:23,634 ERROR [nova.compute.manager] Setting instance vm_state to ERROR
  Traceback (most recent call last):
    File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 8656, in _error_out_instance_on_exception
      yield
    File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/manager.py", line 4830, in _resize_instance
      migration_p)
    File "/home/zuul/src/opendev.org/openstack/nova/nova/network/neutronv2/api.py", line 2697, in migrate_instance_start
      client = _get_ksa_client(context, admin=True)
    File "/home/zuul/src/opendev.org/openstack/nova/nova/network/neutronv2/api.py", line 215, in _get_ksa_client
      auth_plugin = _get_auth_plugin(context, admin=admin)
    File "/home/zuul/src/opendev.org/openstack/nova/nova/network/neutronv2/api.py", line 151, in _get_auth_plugin
      _ADMIN_AUTH = _load_auth_plugin(CONF)
    File "/home/zuul/src/opendev.org/openstack/nova/nova/network/neutronv2/api.py", line 82, in _load_auth_plugin
      raise neutron_client_exc.Unauthorized(message=err_msg)
  neutronclient.common.exceptions.Unauthorized: Unknown auth type: None

  According to logstash this started showing up around 8/5:

  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22ERROR%20%5Bnova.network.neutronv2.api%5D%20The%20%5Bneutron%5D%20section%20of%20your%20nova%20configuration%20file%20must%20be%20configured%20for%20authentication%20with%20the%20networking%20service%20endpoint.%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

  Which makes me think this change, which is restarting a compute
  service and sleeping in a stub:

  https://review.opendev.org/#/c/670393/

  Might be screwing up concurrently running tests.

  Looking at when that test runs and the ones that fails:

  2019-08-07 23:21:54.157918 | ubuntu-bionic | {4}
  nova.tests.functional.compute.test_init_host.ComputeManagerInitHostTestCase.test_migrate_disk_and_power_off_crash_finish_revert_migration
  [4.063814s] ... ok

  2019-08-07 23:25:00.073443 | ubuntu-bionic | {4}
  nova.tests.functional.notification_sample_tests.test_service.TestServiceUpdateNotificationSampleLatest.test_service_disabled
  [160.155643s] ... FAILED

  We can see they are on the same worker process and run at about the
  same time.

  Furthermore, we can see that
  TestServiceUpdateNotificationSampleLatest.test_service_disabled
  eventually times out after 160 seconds and this is in the error
  output:

  2019-08-07 23:24:59,911 ERROR [nova.compute.api] An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host host1. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.
  Traceback (most recent call last):
    File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/api.py", line 5034, in _update_compute_provider_status
      self.rpcapi.set_host_enabled(context, service.host, enabled)
    File "/home/zuul/src/opendev.org/openstack/nova/nova/compute/rpcapi.py", line 996, in set_host_enabled
      return cctxt.call(ctxt, 'set_host_enabled', enabled=enabled)
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 181, in call
      transport_options=self.transport_options)
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_messaging/transport.py", line 129, in _send
      transport_options=transport_options)
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_fake.py", line 224, in send
      transport_options)
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/oslo_messaging/_drivers/impl_fake.py", line 208, in _send
      reply, failure = reply_q.get(timeout=timeout)
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
      return waiter.wait()
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
      return get_hub().switch()
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
      return self.greenlet.switch()
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 350, in run
      self.wait(sleep_time)
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/eventlet/hubs/poll.py", line 77, in wait
      time.sleep(seconds)
    File "/home/zuul/src/opendev.org/openstack/nova/.tox/functional-py36/lib/python3.6/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
      raise TimeoutException()
  fixtures._fixtures.timeout.TimeoutException

  So test_migrate_disk_and_power_off_crash_finish_revert_migration is
  probably not cleaning up properly.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1839515/+subscriptions


References