yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1509184] [NEW] Enable openflow based dvr routing for east/west traffic

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: sean mooney <sean.k.mooney@xxxxxxxxx>
Date: Fri, 23 Oct 2015 05:31:02 -0000
Reply-to: Bug 1509184 <1509184@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

In the juno cycle dvr support was added to neutron do decentralise routing to the compute nodes.
This  RFE bug proposes the introduction of a new dvr mode (dvr_local_openflow) to optimise the datapath
of east/west traffic.

-----------------------------------------------High level description-------------------------------
The current implementation of DVR with ovs utilizes linux network namespaces to instantiate l3
routers, the details of which are described here: http://docs.openstack.org/networking-guide/scenario_dvr_ovs.html

fundamentally a neutron router comprises of 3 elements.
- a router instance (network namespace)
- a router interface (tap device)
- a set or routing rules (kernel ip routes)

In the special case of routing east/west traffic both the source and destination interfaces are known to neutron.
because of that fact neutron contains all the information required to logically route traffic from its origin to its destination
enabling the path to be established primitively. this proposal suggests moving the instantiation of the dvr local router from the kernel ip stack to Open vSwitch(ovs) for east/west traffic. 

Open vSwitch provides a logical programmable interface (Openflow) to configure traffic forwarding and modification actions on arbitrary packet streams. When managed by the neutron openvswich l2 agent, ovs operates as a simple mac learning switch with limited utilisation of it programmable dataplane. to utilise ovs to create an l3 router the follow mappings from the 3 fundamental elements can be made
- a router instance (network namespace + a ovs bridge)
- a router interface (tap device  + patch port pair)
- a set or routing rules (kernel ip routes + openflow rules)

----------------------------------------background context---------------------------------------------
TL;DR 
basic explanation of openflow/ovs briges and patch ports
skip to implementation section if familiar.

ovs implementation background:
In openvswich at the control layer an ovs bridge is a unique logical domain of interfaces and flow rules.
Similarly at the control layer a patch port pair is a logical entity that interconnects two bridges(or logical domains).

>From a dataplane perspective each ovs bridge is  first created as a separate instance of a dataplane.
if these separate bridges/dataplanes are interconnected by patch ports, ovs  will collapse the independent dataplanes into a single
ovs dataplane instance. As a direct result of this implementation a logical topology of 1 bridge with two interfaces is realised in the dataplane level identically to 2 bridges each with 1 interface interconnected by path ports. This translate to zero dataplane overhead to the creation of multiple bridge allowing for arbitrary numbers of router instances to be created.

Openflow capability background:
The openflow protocol provides many capabilities which can be generally summarised as packet match criteria and actions to apply
when the criteria is satisfied.  In the case of l3 routeing the match criteria of relevance are the Ethernet type and the destination ip address.similarly the openflow actions required are mod_dest,set_field,move,dec_ttl,output and drop.

logical packet flow for a ping between two vms on same host:
in the l2 case if a vm tries to ping another vm in the same subnet thre are 4 stages. 
- first it will send a broadcast arp packet to learn the mac address from the destination ip of the remote vm. 
- second the destination vm receives the arp request and learns the source vms mac,then replies as follows:
    a.) swap the source and destination ip of the arp packet
    b.) copy the source mac address to the destination mac address and set the source mac address to the local interface mac.
    c.) set arp type code form request to reply.
    d.) transmit reply via received interface
- third on receiving the arp reply the source vm will transmit the icmp packet 
  source vm will then transmit the icmp packet to the destination vm with the learned mac address
-  fourth on receiving the icmp the destination vm replies.

in the l3 case the packet flow is similar but slightly different.
- first the source vm sends an arp to the subnet gateway. 
- second the gateway router responds with its mac address
- third the source vm send the icmp packet to the router
- fourth the  router receives the icmp packet and send an arp to the destination vm.
- fifth the destination vm sends a arp reply to the gateway
- sixth the router forwards the icmp to the destination vm
-seventh the destination vm replies to the router
- eight the reply is received by the source vm.

----------------------------------current
implementation---------------------------------------------------

l3 ping packet flow in dvr_local mode(simplified to ignore broadcast):
logical:
- the arp packet is received from the source vm and logically vlan tagged(tenant isolation)
- the arp packet is output to the router tap device(tap1), the vlan is striped and the packet is copied from the ovs dataplane to the 
  kernel networking stack in the routers linux namespace.
- the kernel network stack replies to the arp and the reply packet is copied to the ovs dataplane and it is logically vlan tagged
- the vlan is logically striped and the  arp reply switched to the source vm interface.
- the icmp packet is received from the source vm and logically vlan tagged(tenant isolation)
- the icmp packet is output to the route tap device, the vlan is striped and the packet is copied from the ovs dataplane to the 
  kernel networking stack in the routers linux namespace.
- the kernel generates an arp request to the destination vm which follows the same path as the arp described above
- the kernel  modifies the dest mac address, decrements the ttl and routes the packet to the appropriate tap device(tap2) where the packet is copied to the ovs dataplane and it is logically vlan tagged
- the vlan is logically striped and the  icmp packet switched to the destination vm interface.
- the reply path is similarly and is shortened as follows:
   destvm->vlan tagged->vlan stripped -> copied to kernel name space via tap2-> copied to ovs dataplane via tap1-> vlan tagged-> vlan stripped-> received by source vm.

actual:
- arp form source vm -> tap1 (vlan tagging skipped) + broadcast to other ports
- tap1-> kernel network stack
- kernel sends arp reply tap1
- tap1-> source vm (vlan tagging skipped)
- icmp from source vm -> tap1(vlan tagging skipped)
- kernel receives icmp on tap1 and send arp request to  dest vm via tap2(broadcast)
- arp via tap2 -> dest vm (vlan tagging skipped) 
- dest vm replies -> tap2
- kernel updates dest mac and decrement ttl the forward icmp packet to tap2
- tap2 -> dest vm-> dest vm replies->tap2.(vlan tagging skipped) 
- kernel updates dest mac and decrement ttl the forward icmp reply packet to tap1
- tap1-> source vm

-------------------------------------proposed change----------------------------------------------------------
Proposed change:
- a new class will be added to implement the new mode that subclasses the existing
  dvr_local router class.
- if mode is dvr_local_openflow a routing bridge will be created for each dvr router.
- when an internal network is added to the router the following actions will be preformed:
  a.) the tap interface will be created in the router network namespaces as normal but added 
        to routing bridge instead of the br-int.(tap devices are only used for north/south traffic)
  b.) a patch port pair will be created between the br-int and routing bridge
  c.) the  attached-mac,iface-id and iface-status will be populated in the external-id field or the br-int side of the patch port.
       this will enabled the unmodified neutron l2 agent to correctly manage the patch port.
  d.) a low priority rule that send all traffic form the patch port to the tap device will be added to the routing bridge.
  e.) a medium priority rule that will reply to all arp request to the router will be added to the routing bridge.
        this rule will use openflows move and set field actions to rewrite the arp request into a reply and output=in_port.
  f.) a high priority dest mac update and ttl decrement rule will be added to the routing bridge for each port 
       on the internal network.
- when an external network is added to the router the workflow will be unchanged and is inherited from the dvr_local
  implementation.
- the _update_arp_entry function will be extended additional populate and delete the  high priority dest mac update rules
  as neutron ports are added/removed form connected networks.

l3 packet flow in dvr_local_openflow mode:

logical:
- the arp packet is received from the source vm and logically vlan tagged(tenant isolation)
- the arp packet is output to the router bridge patch port , the vlan is striped
- the arp request is rewritten into a reply and sent back to the br-int and logically vlan tagged
- the vlan is logically striped and the  arp reply switched to the source vm interface.
- the icmp packet is received from the source vm and logically vlan tagged(tenant isolation)
- the icmp packet is output to the router bridge patch port , the vlan is striped.
- the icmp packet matches the high priority rule and its destination mac is updated the it is output to the second patch port and it is logically vlan tagged
- the vlan is logically striped and the  icmp packet switched to the destination vm interface.
- the reply path is similarly and is shortened as follows:
   destvm->vlan tagged->vlan stripped -> router bridge via patch 2-> dest mac and ttl updated then output patch 1-> vlan tagged-> vlan stripped-> received by source vm.

actual:
- arp form source vm -> arp rewritten to reply -> sent to source vm ( single openflow action).
- icmp from source vm ->  destination mac update, ttl decremented -> dest vm ( single openflow action)
- icmp from dest vm ->  destination mac update, ttl decremented -> source vm ( single openflow action)

other considerations:

- north/south
    as ovs cannot lookup the destination mac dynamically via arp it is not possible to optimise the 
    north/south path as described above.

- openvswich support
    this mechanism is compatible with both kernel and dpdk ovs.
    this mechanism requires nicira extensions for arp rewrite.
    arp rewrite can be skipped for great support if required as it will fall back to  tap device and kernel.
    icmp traffic for router interface will be handled by tap device as ovs currently does not 
    support setting icmp type code via set_field or load openflow actions.

- performance
   performance of l3 routing is expected to approach l2 performance for east/west traffic.
   performance is not expected to change for north/south.

** Affects: neutron
     Importance: Undecided
     Assignee: sean mooney (sean-k-mooney)
         Status: New


** Tags: rfe

** Changed in: neutron
     Assignee: (unassigned) => sean mooney (sean-k-mooney)

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1509184

Title:
  Enable openflow based dvr routing for east/west traffic

Status in neutron:
  New

Bug description:
  In the juno cycle dvr support was added to neutron do decentralise routing to the compute nodes.
  This  RFE bug proposes the introduction of a new dvr mode (dvr_local_openflow) to optimise the datapath
  of east/west traffic.

  -----------------------------------------------High level description-------------------------------
  The current implementation of DVR with ovs utilizes linux network namespaces to instantiate l3
  routers, the details of which are described here: http://docs.openstack.org/networking-guide/scenario_dvr_ovs.html

  fundamentally a neutron router comprises of 3 elements.
  - a router instance (network namespace)
  - a router interface (tap device)
  - a set or routing rules (kernel ip routes)

  In the special case of routing east/west traffic both the source and destination interfaces are known to neutron.
  because of that fact neutron contains all the information required to logically route traffic from its origin to its destination
  enabling the path to be established primitively. this proposal suggests moving the instantiation of the dvr local router from the kernel ip stack to Open vSwitch(ovs) for east/west traffic. 

  Open vSwitch provides a logical programmable interface (Openflow) to configure traffic forwarding and modification actions on arbitrary packet streams. When managed by the neutron openvswich l2 agent, ovs operates as a simple mac learning switch with limited utilisation of it programmable dataplane. to utilise ovs to create an l3 router the follow mappings from the 3 fundamental elements can be made
  - a router instance (network namespace + a ovs bridge)
  - a router interface (tap device  + patch port pair)
  - a set or routing rules (kernel ip routes + openflow rules)

  ----------------------------------------background context---------------------------------------------
  TL;DR 
  basic explanation of openflow/ovs briges and patch ports
  skip to implementation section if familiar.

  ovs implementation background:
  In openvswich at the control layer an ovs bridge is a unique logical domain of interfaces and flow rules.
  Similarly at the control layer a patch port pair is a logical entity that interconnects two bridges(or logical domains).

  From a dataplane perspective each ovs bridge is  first created as a separate instance of a dataplane.
  if these separate bridges/dataplanes are interconnected by patch ports, ovs  will collapse the independent dataplanes into a single
  ovs dataplane instance. As a direct result of this implementation a logical topology of 1 bridge with two interfaces is realised in the dataplane level identically to 2 bridges each with 1 interface interconnected by path ports. This translate to zero dataplane overhead to the creation of multiple bridge allowing for arbitrary numbers of router instances to be created.

  Openflow capability background:
  The openflow protocol provides many capabilities which can be generally summarised as packet match criteria and actions to apply
  when the criteria is satisfied.  In the case of l3 routeing the match criteria of relevance are the Ethernet type and the destination ip address.similarly the openflow actions required are mod_dest,set_field,move,dec_ttl,output and drop.

  logical packet flow for a ping between two vms on same host:
  in the l2 case if a vm tries to ping another vm in the same subnet thre are 4 stages. 
  - first it will send a broadcast arp packet to learn the mac address from the destination ip of the remote vm. 
  - second the destination vm receives the arp request and learns the source vms mac,then replies as follows:
      a.) swap the source and destination ip of the arp packet
      b.) copy the source mac address to the destination mac address and set the source mac address to the local interface mac.
      c.) set arp type code form request to reply.
      d.) transmit reply via received interface
  - third on receiving the arp reply the source vm will transmit the icmp packet 
    source vm will then transmit the icmp packet to the destination vm with the learned mac address
  -  fourth on receiving the icmp the destination vm replies.

  in the l3 case the packet flow is similar but slightly different.
  - first the source vm sends an arp to the subnet gateway. 
  - second the gateway router responds with its mac address
  - third the source vm send the icmp packet to the router
  - fourth the  router receives the icmp packet and send an arp to the destination vm.
  - fifth the destination vm sends a arp reply to the gateway
  - sixth the router forwards the icmp to the destination vm
  -seventh the destination vm replies to the router
  - eight the reply is received by the source vm.

  ----------------------------------current
  implementation---------------------------------------------------

  l3 ping packet flow in dvr_local mode(simplified to ignore broadcast):
  logical:
  - the arp packet is received from the source vm and logically vlan tagged(tenant isolation)
  - the arp packet is output to the router tap device(tap1), the vlan is striped and the packet is copied from the ovs dataplane to the 
    kernel networking stack in the routers linux namespace.
  - the kernel network stack replies to the arp and the reply packet is copied to the ovs dataplane and it is logically vlan tagged
  - the vlan is logically striped and the  arp reply switched to the source vm interface.
  - the icmp packet is received from the source vm and logically vlan tagged(tenant isolation)
  - the icmp packet is output to the route tap device, the vlan is striped and the packet is copied from the ovs dataplane to the 
    kernel networking stack in the routers linux namespace.
  - the kernel generates an arp request to the destination vm which follows the same path as the arp described above
  - the kernel  modifies the dest mac address, decrements the ttl and routes the packet to the appropriate tap device(tap2) where the packet is copied to the ovs dataplane and it is logically vlan tagged
  - the vlan is logically striped and the  icmp packet switched to the destination vm interface.
  - the reply path is similarly and is shortened as follows:
     destvm->vlan tagged->vlan stripped -> copied to kernel name space via tap2-> copied to ovs dataplane via tap1-> vlan tagged-> vlan stripped-> received by source vm.

  actual:
  - arp form source vm -> tap1 (vlan tagging skipped) + broadcast to other ports
  - tap1-> kernel network stack
  - kernel sends arp reply tap1
  - tap1-> source vm (vlan tagging skipped)
  - icmp from source vm -> tap1(vlan tagging skipped)
  - kernel receives icmp on tap1 and send arp request to  dest vm via tap2(broadcast)
  - arp via tap2 -> dest vm (vlan tagging skipped) 
  - dest vm replies -> tap2
  - kernel updates dest mac and decrement ttl the forward icmp packet to tap2
  - tap2 -> dest vm-> dest vm replies->tap2.(vlan tagging skipped) 
  - kernel updates dest mac and decrement ttl the forward icmp reply packet to tap1
  - tap1-> source vm

  -------------------------------------proposed change----------------------------------------------------------
  Proposed change:
  - a new class will be added to implement the new mode that subclasses the existing
    dvr_local router class.
  - if mode is dvr_local_openflow a routing bridge will be created for each dvr router.
  - when an internal network is added to the router the following actions will be preformed:
    a.) the tap interface will be created in the router network namespaces as normal but added 
          to routing bridge instead of the br-int.(tap devices are only used for north/south traffic)
    b.) a patch port pair will be created between the br-int and routing bridge
    c.) the  attached-mac,iface-id and iface-status will be populated in the external-id field or the br-int side of the patch port.
         this will enabled the unmodified neutron l2 agent to correctly manage the patch port.
    d.) a low priority rule that send all traffic form the patch port to the tap device will be added to the routing bridge.
    e.) a medium priority rule that will reply to all arp request to the router will be added to the routing bridge.
          this rule will use openflows move and set field actions to rewrite the arp request into a reply and output=in_port.
    f.) a high priority dest mac update and ttl decrement rule will be added to the routing bridge for each port 
         on the internal network.
  - when an external network is added to the router the workflow will be unchanged and is inherited from the dvr_local
    implementation.
  - the _update_arp_entry function will be extended additional populate and delete the  high priority dest mac update rules
    as neutron ports are added/removed form connected networks.

  l3 packet flow in dvr_local_openflow mode:

  logical:
  - the arp packet is received from the source vm and logically vlan tagged(tenant isolation)
  - the arp packet is output to the router bridge patch port , the vlan is striped
  - the arp request is rewritten into a reply and sent back to the br-int and logically vlan tagged
  - the vlan is logically striped and the  arp reply switched to the source vm interface.
  - the icmp packet is received from the source vm and logically vlan tagged(tenant isolation)
  - the icmp packet is output to the router bridge patch port , the vlan is striped.
  - the icmp packet matches the high priority rule and its destination mac is updated the it is output to the second patch port and it is logically vlan tagged
  - the vlan is logically striped and the  icmp packet switched to the destination vm interface.
  - the reply path is similarly and is shortened as follows:
     destvm->vlan tagged->vlan stripped -> router bridge via patch 2-> dest mac and ttl updated then output patch 1-> vlan tagged-> vlan stripped-> received by source vm.

  actual:
  - arp form source vm -> arp rewritten to reply -> sent to source vm ( single openflow action).
  - icmp from source vm ->  destination mac update, ttl decremented -> dest vm ( single openflow action)
  - icmp from dest vm ->  destination mac update, ttl decremented -> source vm ( single openflow action)

  other considerations:

  - north/south
      as ovs cannot lookup the destination mac dynamically via arp it is not possible to optimise the 
      north/south path as described above.

  - openvswich support
      this mechanism is compatible with both kernel and dpdk ovs.
      this mechanism requires nicira extensions for arp rewrite.
      arp rewrite can be skipped for great support if required as it will fall back to  tap device and kernel.
      icmp traffic for router interface will be handled by tap device as ovs currently does not 
      support setting icmp type code via set_field or load openflow actions.

  - performance
     performance of l3 routing is expected to approach l2 performance for east/west traffic.
     performance is not expected to change for north/south.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1509184/+subscriptions
Follow ups

[Bug 1509184] Re: Enable openflow based dvr routing for east/west traffic
From: Armando Migliaccio, 2015-11-24