yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #51146
[Bug 1453350] Re: race between neutron port create and nova boot
Reviewed: https://review.openstack.org/181674
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b672c26cb42ad3d9a17ed049b506b5622601e891
Submitter: Jenkins
Branch: master
commit b672c26cb42ad3d9a17ed049b506b5622601e891
Author: Kevin Benton <kevin@xxxxxxxxxx>
Date: Fri Apr 15 06:05:56 2016 -0700
Add provisioning blocks to status ACTIVE transition
Sometimes an object requires multiple disjoint actors to complete
a set of tasks before the status of the object should be transitioned
to ACTIVE. The main example of this is when a port is being created.
The L2 agent has to do its business to wire up the VIF, but at the same
time the DHCP agent has to setup the DHCP reservation. This led to
Nova booting the VM when the L2 agent was done even though the DHCP
agent may have been nowhere near ready.
This patch introduces a provisioning blocks mechansim that allows the
entities to be tracked that need to be involved to make a transition
to ACTIVE happen. See the devref in the dependent patch for a high-level
view of how this works.
The ML2 code is updated to use this new mechanism to prevent updating
the port status to ACTIVE without both the DHCP agent and L2 agent
reporting that the port is ready.
The DHCP RPC API required a version bump to allow the port ready
notification.
This also adds a devref doc for the provisioning_blocks
module with a high-level overview of how it works in addition
to a detailed description of how it is used specifically with
ML2, the L2 agents, and the DHCP agents.
Closes-Bug: #1453350
Change-Id: Id85ff6de1a14a550ab50baf4f79d3130af3680c8
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1453350
Title:
race between neutron port create and nova boot
Status in neutron:
Fix Released
Bug description:
I am doing load testing with tempest scenario tests and see what I think is a race condition between neutron dhcp standup and nova boot. I believe the scenario I am seeing to be a more general case of https://bugs.launchpad.net/neutron/+bug/1334447.
test environment: 5 compute nodes, 1 controller node running all api
and neutron services. ubuntu juno hand patched 1382064 and 1385257
and my workaround in 1451492. standard neutron setup otherwise.
If I run tempest scenario test test_server_basic_ops 30 times in
parallel things consistently work fine. If I increase to 60 in
parallel I get lots of failures (see below). Upon investigation, it
looks to me that neutron standup of netns and its dnsmasq process is
too slow and loses the race with nova boot and the VM comes up without
a (dhcp provided) IP address (causing ssh to timeout and fail).
Traceback (most recent call last):
File "/home/aqua/tempest/tempest/test.py", line 125, in wrapper
return f(self, *func_args, **func_kwargs)
File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 105, in test_server_basicops
self.verify_ssh()
File "/home/aqua/tempest/tempest/scenario/test_server_basic_ops_38.py", line 95, in verify_ssh
private_key=self.keypair['private_key'])
File "/home/aqua/tempest/tempest/scenario/manager.py", line 310, in get_remote_client
linux_client.validate_authentication()
File "/home/aqua/tempest/tempest/common/utils/linux/remote_client.py", line 55, in validate_authentication
self.ssh_client.test_connection_auth()
File "/home/aqua/tempest/tempest/common/ssh.py", line 150, in test_connection_auth
connection = self._get_ssh_connection()
File "/home/aqua/tempest/tempest/common/ssh.py", line 87, in _get_ssh_connection
password=self.password)
tempest.exceptions.SSHTimeout: Connection to the 172.17.205.21 via SSH timed out.
User: cirros, Password: None
Ran 60 tests in 742.931s
FAILED (failures=47)
To reproduce test environment:
1) checkout tempest and remove all tempest scenario tests except test_server_basic_ops
2) run this command to make 59 copies of the test: for i in {1..59}; do cp -p test_server_basic_ops.py test_server_basic_ops_$i.py; sed --in-place -e "s/class TestServerBasicOps(manager.ScenarioTest):/class TestServerBasicOps$i(manager.ScenarioTest):/" -e "s/ super(TestServerBasicOps, self).setUp()/ super(TestServerBasicOps$i, self).setUp()/" -e "s/ @test.idempotent_id('7fff3fb3-91d8-4fd0-bd7d-0204f1f180ba')/ @test.idempotent_id(\'$(uuidgen)\')/" test_server_basic_ops_$i.py; done
3) run 30 tests and observe successful run: OS_TEST_TIMEOUT=1200 ./run_tempest.sh tempest.scenario -- --concurrency=30
4) run 60 tests and observe failures: OS_TEST_TIMEOUT=1200 ./run_tempest.sh tempest.scenario -- --concurrency=60
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1453350/+subscriptions
References