← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1360351] [NEW] FWaaS stuck in PENDING_CREATE when deploying with DVR

 

Public bug reported:

When Firewall is created in conjunction with a distributed router, the
firewall may or may not fail to reach the ACTIVE state as observed in
[1]. The reason for this faulty behavior is because, when the firewall
create request comes in, the firewall object's status is set
PENDING_CREATE (see [2]), then the request is send to the L3 agent,
which will work to make the state transition to ACTIVE (see [3]). This
state transition is predicated on the following statements being true:

1) The tenant has firewalls
2) The tenant has routers
3) The routers' namespaces have been created on the L3 agent

Now, in the DVR case, 3) might be true or not  depending on the state of
the router or the cloud (see [4] for details). For instance, if the
router had an external gateway set, condition 3) would be True, and the
firewall state would transition to ACTIVE. This may lead the user to
believe that everything is correct when it is actually not. What makes
the matter worse is the fact that in the DVR case, the firewall needs
itself to be distributed, which means that if we kept the same logic as
outlined in [2], [3], the last L3 agent to update the state of the
firewall will overwrite any other (last write wins), leading to
potential inconsistency.

To start addressing this issue, it would be appropriate to tweak the
logic as follow:

a) When DVR is present, firewall should be created directly in CREATED state, we'll keep the logic for the centralized case as is, where the firewall is created in PENDING_CREATE state
b) When L3 agents can install the right firewall rules, the server will need to collect all the acknowledgments from the L3 agents
c) Only after all acknowledgments have been collected and they are positive, the firewall state will transition from CREATED to ACTIVE, ERROR otherwise.

In theory, step a) could be simplified by renaming PENDING_CREATE to
CREATED and leave it at that, however this would be a non-backward
compatible API change which would affect the legacy case, and should be
discouraged.

[1] - http://logs.openstack.org/91/114691/2/experimental/check-tempest-dsvm-neutron-dvr/93b2ff0/logs/testr_results.html.gz
[2] - https://github.com/openstack/neutron/blob/master/neutron/services/firewall/fwaas_plugin.py#L227
[3] - https://github.com/openstack/neutron/blob/master/neutron/services/firewall/agents/l3reference/firewall_l3_agent.py#L194
[4] - https://review.openstack.org/#/c/116100/

** Affects: neutron
     Importance: Undecided
     Assignee: Armando Migliaccio (armando-migliaccio)
         Status: New


** Tags: l3-dvr-backlog

** Changed in: neutron
     Assignee: (unassigned) => Armando Migliaccio (armando-migliaccio)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1360351

Title:
  FWaaS stuck in PENDING_CREATE when deploying with DVR

Status in OpenStack Neutron (virtual network service):
  New

Bug description:
  When Firewall is created in conjunction with a distributed router, the
  firewall may or may not fail to reach the ACTIVE state as observed in
  [1]. The reason for this faulty behavior is because, when the firewall
  create request comes in, the firewall object's status is set
  PENDING_CREATE (see [2]), then the request is send to the L3 agent,
  which will work to make the state transition to ACTIVE (see [3]). This
  state transition is predicated on the following statements being true:

  1) The tenant has firewalls
  2) The tenant has routers
  3) The routers' namespaces have been created on the L3 agent

  Now, in the DVR case, 3) might be true or not  depending on the state
  of the router or the cloud (see [4] for details). For instance, if the
  router had an external gateway set, condition 3) would be True, and
  the firewall state would transition to ACTIVE. This may lead the user
  to believe that everything is correct when it is actually not. What
  makes the matter worse is the fact that in the DVR case, the firewall
  needs itself to be distributed, which means that if we kept the same
  logic as outlined in [2], [3], the last L3 agent to update the state
  of the firewall will overwrite any other (last write wins), leading to
  potential inconsistency.

  To start addressing this issue, it would be appropriate to tweak the
  logic as follow:

  a) When DVR is present, firewall should be created directly in CREATED state, we'll keep the logic for the centralized case as is, where the firewall is created in PENDING_CREATE state
  b) When L3 agents can install the right firewall rules, the server will need to collect all the acknowledgments from the L3 agents
  c) Only after all acknowledgments have been collected and they are positive, the firewall state will transition from CREATED to ACTIVE, ERROR otherwise.

  In theory, step a) could be simplified by renaming PENDING_CREATE to
  CREATED and leave it at that, however this would be a non-backward
  compatible API change which would affect the legacy case, and should
  be discouraged.

  [1] - http://logs.openstack.org/91/114691/2/experimental/check-tempest-dsvm-neutron-dvr/93b2ff0/logs/testr_results.html.gz
  [2] - https://github.com/openstack/neutron/blob/master/neutron/services/firewall/fwaas_plugin.py#L227
  [3] - https://github.com/openstack/neutron/blob/master/neutron/services/firewall/agents/l3reference/firewall_l3_agent.py#L194
  [4] - https://review.openstack.org/#/c/116100/

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1360351/+subscriptions


Follow ups

References