yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1609738] Re: l3-ha: a router can be stuck in the ALLOCATING state

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: OpenStack Infra <1609738@xxxxxxxxxxxxxxxxxx>
Date: Tue, 10 Jan 2017 18:12:04 -0000
Reply-to: Bug 1609738 <1609738@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Reviewed:  https://review.openstack.org/357966
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3e4c0ae223de776732f70626b387fba4de2e9c3f
Submitter: Jenkins
Branch:    master

commit 3e4c0ae223de776732f70626b387fba4de2e9c3f
Author: John Schwarz <jschwarz@xxxxxxxxxx>
Date:   Fri Aug 19 15:23:36 2016 +0100

    Revert "Add ALLOCATING state to routers"
    
    This reverts commit 9c3c19f07ce52e139d431aec54341c38a183f0b7.
    
    Following the merge of Ie98d5e3760cdb17450aea546f4b61f5ba14baf1c, the
    creation of new router uses RouterL3AgentBinding and its' new
    binding_index attribute to ensure correctness of the resources. As such,
    the ALLOCATING state (which was used to do just that) is no longer
    needed and can be removed.
    
    Closes-Bug: #1609738
    Change-Id: Ib04e08df13ef4e6b94bd588854a5795163e2a617


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1609738

Title:
  l3-ha: a router can be stuck in the ALLOCATING state

Status in neutron:
  Fix Released

Bug description:
  The scenario is a simple one: during the creation of a router, the
  server that deals with the request crashes after creating the router
  with the ALLOCATING state [1] but before it's changed to ACTIVE [2].
  In this case, the router will be "stuck" in the ALLOCATING and the
  only admin action to change the router back to ACTIVE (and allow it to
  be scheduled to agents) is:

  1. set admin-state-up to False
  2. set ha to False
  3. set ha to True
  4. set admin-state-up to True

  That is, a full migration of the HA router to legacy and back to HA is
  required. This will trigger the code in [3] and will fix this issue.
  However, these 4 steps aren't intuitive at all - why should a user re-
  set the router as an HA to solve a weird state of the router?

  Skipping steps 2 and 3 (only re-setting the admin-state-up) won't work
  because, as mentioned before, the scheduling happens on steps 2 and 3
  (i.e. when the router is set to ha=False it's unscheduled, and when
  it's set to ha=True it is scheduled as if it's a new router). In fact,
  this means that the problem is more severe: if the server crashed in
  the middle of setting up the resources of an HA router, all 4 steps
  must be done to ensure the router is made valid again.

  The proposed solution is to add a new state, such that if admin-state-
  up is changed to False then the router's status will be changed to
  "DOWN" (as opposed to the current "ACTIVE", which doesn't make much
  sense since admin-state-up is False). This will help mitigate the
  "stuck ALLOCATING status" portion of the problem.

  In addition to changing the status, we will need to change the logic
  such that a router is unscheduled on admin-state-up=False and
  scheduled on admin-state-up=True. This will let us skip steps 2 and 3
  and go straight for re-setting the admin-state-up attribute of a
  router, which is more intuitive.

  [1]: https://github.com/openstack/neutron/blob/ff5b38071e7e134baa0dc7a52280f9bcbc06efaf/neutron/db/l3_hamode_db.py#L469
  [2]: https://github.com/openstack/neutron/blob/ff5b38071e7e134baa0dc7a52280f9bcbc06efaf/neutron/db/l3_hamode_db.py#L485
  [3]: https://github.com/openstack/neutron/blob/ff5b38071e7e134baa0dc7a52280f9bcbc06efaf/neutron/db/l3_hamode_db.py#L570

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1609738/+subscriptions

References

[Bug 1609738] [NEW] l3-ha: a router can be stuck in the ALLOCATING state
From: John Schwarz, 2016-08-04