← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1453350] Re: race between neutron port create and nova boot

 

Reviewed:  https://review.openstack.org/181674
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b672c26cb42ad3d9a17ed049b506b5622601e891
Submitter: Jenkins
Branch:    master

commit b672c26cb42ad3d9a17ed049b506b5622601e891
Author: Kevin Benton <kevin@xxxxxxxxxx>
Date:   Fri Apr 15 06:05:56 2016 -0700

    Add provisioning blocks to status ACTIVE transition
    
    Sometimes an object requires multiple disjoint actors to complete
    a set of tasks before the status of the object should be transitioned
    to ACTIVE. The main example of this is when a port is being created.
    The L2 agent has to do its business to wire up the VIF, but at the same
    time the DHCP agent has to setup the DHCP reservation. This led to
    Nova booting the VM when the L2 agent was done even though the DHCP
    agent may have been nowhere near ready.
    
    This patch introduces a provisioning blocks mechansim that allows the
    entities to be tracked that need to be involved to make a transition
    to ACTIVE happen. See the devref in the dependent patch for a high-level
    view of how this works.
    
    The ML2 code is updated to use this new mechanism to prevent updating
    the port status to ACTIVE without both the DHCP agent and L2 agent
    reporting that the port is ready.
    
    The DHCP RPC API required a version bump to allow the port ready
    notification.
    
    This also adds a devref doc for the provisioning_blocks
    module with a high-level overview of how it works in addition
    to a detailed description of how it is used specifically with
    ML2, the L2 agents, and the DHCP agents.
    
    Closes-Bug: #1453350
    Change-Id: Id85ff6de1a14a550ab50baf4f79d3130af3680c8


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1453350

Title:
  race between neutron port create and nova boot

Status in neutron:
  Fix Released

Bug description:
  
  I am doing load testing with tempest scenario tests and see what I think is a race condition between neutron dhcp standup and nova boot.  I believe the scenario I am seeing to be a more general case of https://bugs.launchpad.net/neutron/+bug/1334447.

  test environment: 5 compute nodes, 1 controller node running all api
  and neutron services.  ubuntu juno hand patched 1382064 and 1385257
  and my workaround in 1451492.  standard neutron setup otherwise.

  If I run tempest scenario test test_server_basic_ops 30 times in
  parallel things consistently work fine.  If I increase to 60 in
  parallel I get lots of failures (see below).  Upon investigation, it
  looks to me that neutron standup of netns and its dnsmasq process is
  too slow and loses the race with nova boot and the VM comes up without
  a (dhcp provided) IP address (causing ssh to timeout and fail).

  
  Traceback (most recent call last):
    File "/home/aqua/tempest/tempest/test.py", line 125, in wrapper
      return f(self, *func_args, **func_kwargs)
    File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 105, in test_server_basicops
      self.verify_ssh()
    File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 95, in verify_ssh
      private_key=self.keypair['private_key'])
    File "/home/aqua/tempest/tempest/scenario/manager.py", line 310, in get_remote_client
      linux_client.validate_authentication()
    File "/home/aqua/tempest/tempest/common/utils/linux/remote_client.py", line 55, in validate_authentication
      self.ssh_client.test_connection_auth()
    File "/home/aqua/tempest/tempest/common/ssh.py", line 150, in test_connection_auth
      connection = self._get_ssh_connection()
    File "/home/aqua/tempest/tempest/common/ssh.py", line 87, in _get_ssh_connection
      password=self.password)
  tempest.exceptions.SSHTimeout: Connection to the 172.17.205.21 via SSH timed out.
  User: cirros, Password: None

  
  Ran 60 tests in 742.931s

  FAILED (failures=47)

  
  To reproduce test environment:
  1) checkout tempest and remove all tempest scenario tests except test_server_basic_ops
  2) run this command to make 59 copies of the test: for i in {1..59}; do cp -p test_server_basic_ops.py test_server_basic_ops_$i.py; sed --in-place -e "s/class TestServerBasicOps(manager.ScenarioTest):/class TestServerBasicOps$i(manager.ScenarioTest):/" -e "s/        super(TestServerBasicOps, self).setUp()/        super(TestServerBasicOps$i, self).setUp()/" -e "s/    @test.idempotent_id('7fff3fb3-91d8-4fd0-bd7d-0204f1f180ba')/    @test.idempotent_id(\'$(uuidgen)\')/" test_server_basic_ops_$i.py; done
  3) run 30 tests and observe successful run: OS_TEST_TIMEOUT=1200 ./run_tempest.sh tempest.scenario -- --concurrency=30
  4) run 60 tests and observe failures: OS_TEST_TIMEOUT=1200 ./run_tempest.sh tempest.scenario -- --concurrency=60

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1453350/+subscriptions


References