yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #60363
[Bug 1609738] Re: l3-ha: a router can be stuck in the ALLOCATING state
Reviewed: https://review.openstack.org/357966
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3e4c0ae223de776732f70626b387fba4de2e9c3f
Submitter: Jenkins
Branch: master
commit 3e4c0ae223de776732f70626b387fba4de2e9c3f
Author: John Schwarz <jschwarz@xxxxxxxxxx>
Date: Fri Aug 19 15:23:36 2016 +0100
Revert "Add ALLOCATING state to routers"
This reverts commit 9c3c19f07ce52e139d431aec54341c38a183f0b7.
Following the merge of Ie98d5e3760cdb17450aea546f4b61f5ba14baf1c, the
creation of new router uses RouterL3AgentBinding and its' new
binding_index attribute to ensure correctness of the resources. As such,
the ALLOCATING state (which was used to do just that) is no longer
needed and can be removed.
Closes-Bug: #1609738
Change-Id: Ib04e08df13ef4e6b94bd588854a5795163e2a617
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1609738
Title:
l3-ha: a router can be stuck in the ALLOCATING state
Status in neutron:
Fix Released
Bug description:
The scenario is a simple one: during the creation of a router, the
server that deals with the request crashes after creating the router
with the ALLOCATING state [1] but before it's changed to ACTIVE [2].
In this case, the router will be "stuck" in the ALLOCATING and the
only admin action to change the router back to ACTIVE (and allow it to
be scheduled to agents) is:
1. set admin-state-up to False
2. set ha to False
3. set ha to True
4. set admin-state-up to True
That is, a full migration of the HA router to legacy and back to HA is
required. This will trigger the code in [3] and will fix this issue.
However, these 4 steps aren't intuitive at all - why should a user re-
set the router as an HA to solve a weird state of the router?
Skipping steps 2 and 3 (only re-setting the admin-state-up) won't work
because, as mentioned before, the scheduling happens on steps 2 and 3
(i.e. when the router is set to ha=False it's unscheduled, and when
it's set to ha=True it is scheduled as if it's a new router). In fact,
this means that the problem is more severe: if the server crashed in
the middle of setting up the resources of an HA router, all 4 steps
must be done to ensure the router is made valid again.
The proposed solution is to add a new state, such that if admin-state-
up is changed to False then the router's status will be changed to
"DOWN" (as opposed to the current "ACTIVE", which doesn't make much
sense since admin-state-up is False). This will help mitigate the
"stuck ALLOCATING status" portion of the problem.
In addition to changing the status, we will need to change the logic
such that a router is unscheduled on admin-state-up=False and
scheduled on admin-state-up=True. This will let us skip steps 2 and 3
and go straight for re-setting the admin-state-up attribute of a
router, which is more intuitive.
[1]: https://github.com/openstack/neutron/blob/ff5b38071e7e134baa0dc7a52280f9bcbc06efaf/neutron/db/l3_hamode_db.py#L469
[2]: https://github.com/openstack/neutron/blob/ff5b38071e7e134baa0dc7a52280f9bcbc06efaf/neutron/db/l3_hamode_db.py#L485
[3]: https://github.com/openstack/neutron/blob/ff5b38071e7e134baa0dc7a52280f9bcbc06efaf/neutron/db/l3_hamode_db.py#L570
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1609738/+subscriptions
References