yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #56237
[Bug 1602614] Re: DVR + L3 HA loss during failover is higher that it is expected
Reviewed: https://review.openstack.org/255237
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=26d8702b9d7cc5a4293b97bc435fa85983be9f01
Submitter: Jenkins
Branch: master
commit 26d8702b9d7cc5a4293b97bc435fa85983be9f01
Author: venkata anil <anilvenkata@xxxxxxxxxx>
Date: Thu Aug 4 07:14:47 2016 +0000
l2pop fdb flows for HA router ports
This patch makes L3 HA failover not depended on neutron components
(during failover).
All HA agents(active and backup) call update_device_up/down after wiring
the ports. But l2pop driver is called for only active agent as port
binding in DB reflects active agent. Then l2pop creates unicast and
multicast flows for active agent.
On failover, flows to new active agent is created. For this to happen -
all of database, messaging server, neutron-server and destination L3
agent should be active during failover. This creates two issues -
1) When any of the above resources(i.e neutron-server, .. ) are dead,
flows between new master and other agents won't be created and
L3 Ha failover is not working. In same scenario, L3 Ha failover will
work if l2pop is disabled.
2) Packet loss during failover is higher as above neutron resources
interact multiple times, so will take time to create l2 flows.
In this change, we allow plugin to notify l2pop when update_device_up/down
is called by backup agents also. Then l2pop will create flood flows to
all HA agents(both active and slave). L2pop won't create unicast flow for
this port, instead unicast flow is created by learning action of table 10
when keepalived sends GARP after assigning ip address to master router's
qr-xx port. As flood flows are already created and unicast flow is
dynamically added, L3 HA failover is not depended on l2pop.
This solves two isses
1) with L3 HA + l2pop, failover will work even if any of above agents
or processes dead.
2) Reduce failover time as we are not depending on neutron to create
flows during failover.
We use L3HARouterAgentPortBinding table for getting all HA agents of a
router port. HA router port on slave agent is also considered for l2pop
distributed_active_network_ports and agent_network_active_port_count
Closes-bug: #1522980
Closes-bug: #1602614
Change-Id: Ie1f5289390b3ff3f7f3ed7ffc8f6a8258ee8662e
** Changed in: neutron
Status: In Progress => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1602614
Title:
DVR + L3 HA loss during failover is higher that it is expected
Status in neutron:
Fix Released
Bug description:
Scale environment 3 controllers 45 compute nodes. Mitaka, DVR + L3 HA
When active agent is stopped connection is established longer that it
does on the same environment for ha routers.
Steps to reproduce:
1. create 2 routers
neutron router-create router(1,2) --ha True --distributed True
2. Created 2 internal networks, one will be connected to router1, the second to router2
3. In each network was booted an instance
nova boot --image <image_id> --flavor <flavor_id> --nic net_id=<private_net_id> vm(1,2)
4. For vm in the 2d network was also assigned floating ip
5. Login into VM1 using ssh or VNC console
6. Start ping floating ip of 2d vm and check that packets are not lost
7. Check which agent is active for router1 with
neutron l3-agent-list-hosting-router <router_id>
8. Stop active l3 agent
9. Wait until another agent become active in neutron l3-agent-list-hosting-router <router_id>
10. Start stopped agent
11. Stop ping and check the number of packets that was lost.
12. Increase number of routers and repeat steps 5-10
Results for ha+dvr routers http://paste.openstack.org/show/531271/
Note, for ha routers number of loss packets in the same scenario is ~3.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1602614/+subscriptions
References