← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1806390] [NEW] [RFE] Distributed DHCP agent

 

Public bug reported:

It was very old issue and ended with invalid feature though, I could not
find ideal solution so that I raise this issue again. I wonder how other
think of it.

It's heavily related to the old issue
(https://bugs.launchpad.net/neutron/+bug/1468236), and I reconstruct the
issue from my understanding.

Problems
- With giant shared provider network which has over than 10000 ports in a network.
- Several DHCP agents for the network. Even per hypervisor for Calico project.
- Scalability issue (DHCP lease file is not updated after the VM started) occurs.

Solutions from the reporter
1. Add distributed flag for the DHCP agent. And provision DHCP agent on every compute node.
2. Change DHCP agent notifier to specify DHCP agent per hosts
3. Do not spread DHCP flow outside of local hypervisor.

Conclusion
- Rejected because
- Solution step (2) add big complexity to agent notifier RPC.
- (3) is not a general solution.
- Even worse for migration. There were many side effects to we have to care about.
- There were building blocks that we can achieve the purpose. (It was mentioned on IRC, but I still does not understand what the building block that mentioned is.)

Our private cluster is very much like the Calico. We have an giant
provider network and make them routable with quagga and there were DHCP
agents per compute node. I believe that community has formed some
consensus that this kind of architecture is pretty good at handling
scale issues to see the approach like Routed network.

And to achieve the architecture with the lack of L2, modifying DHCP
agent could not be avoided since its default HA behavior make critical
DB performance issues.

But at the same time, I absolutely agreed with the comment which care
about the unnecessary complexity for distributed approach like DVR.

So What I suggest is
- Do not modify current DHCP agent behaviors like notifier side API. It does not harm migration logic.
- Do not change the DHCP HA concept and L2 agent at all.
- Just add a distributed flag for DHCP agent. And add host filtering logic the handler side RPC (get_active_network_info, get_network_info) only when the DHCP agent is distributed.
- Operators have little bit new concept of distributed DHCP which the agent is only for ports within a local hypervisor.

Then we can achieve from the change
- Reduce the performance overhead. I found the performance penalty is related to DB side (getting ports with get_active_info(), and complete provisioning step with dhcp_ready_on_ports(). RPC fanout is minor.
- Make new concept which means DHCP agent failure domain is splitted.

Any comments are appreciated.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: rfe

** Tags added: rfe

** Description changed:

  It was very old issue and ended with invalid feature though, I could not
  find ideal solution so that I raise this issue again. I wonder how other
  think of it.
  
  It's heavily related to the old issue
  (https://bugs.launchpad.net/neutron/+bug/1468236), and I reconstruct the
  issue from my understanding.
  
  Problems
  - With giant shared provider network which has over than 10000 ports in a network.
  - Several DHCP agents for the network. Even per hypervisor for Calico project.
  - Scalability issue (DHCP lease file is not updated after the VM started) occurs.
  
  Solutions from the reporter
  1. Add distributed flag for the DHCP agent. And provision DHCP agent on every compute node.
- 2. Change DHCP agent notifier to specify DHCP agent per 
+ 2. Change DHCP agent notifier to specify DHCP agent per hosts
  3. Do not spread DHCP flow outside of local hypervisor.
  
  Conclusion
  - Rejected because
  - Solution step (2) add big complexity to agent notifier RPC.
  - (3) is not a general solution.
- - Even worse for migration. There were many side effects to we have to care about.    
+ - Even worse for migration. There were many side effects to we have to care about.
  - There were building blocks that we can achieve the purpose. (It was mentioned on IRC, but I still does not understand what the building block that mentioned is.)
  
- 
- Our private cluster is very much like the Calico. We have an giant provider network and make them routable with quagga and there were DHCP agents per compute node. I believe that community has formed some consensus that this kind of architecture is pretty good at handling scale issues to see the approach like Routed network.
+ Our private cluster is very much like the Calico. We have an giant
+ provider network and make them routable with quagga and there were DHCP
+ agents per compute node. I believe that community has formed some
+ consensus that this kind of architecture is pretty good at handling
+ scale issues to see the approach like Routed network.
  
  And to achieve the architecture with the lack of L2, modifying DHCP
  agent could not be avoided since its default HA behavior make critical
  DB performance issues.
  
  But at the same time, I absolutely agreed with the comment which care
  about the unnecessary complexity for distributed approach like DVR.
  
- 
  So What I suggest is
  - Do not modify current DHCP agent behaviors like notifier side API. It does not harm migration logic.
  - Do not change the DHCP HA concept and L2 agent at all.
  - Just add a distributed flag for DHCP agent. And add host filtering logic the handler side RPC (get_active_network_info, get_network_info) only when the DHCP agent is distributed.
  - Operators have little bit new concept of distributed DHCP which the agent is only for ports within a local hypervisor.
-  
+ 
  Then we can achieve from the change
  - Reduce the performance overhead. I found the performance penalty is related to DB side (getting ports with get_active_info(), and complete provisioning step with dhcp_ready_on_ports(). RPC fanout is minor.
  - Make new concept which means DHCP agent failure domain is splitted.
  
- 
  Any comments are appreciated.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1806390

Title:
  [RFE] Distributed DHCP agent

Status in neutron:
  New

Bug description:
  It was very old issue and ended with invalid feature though, I could
  not find ideal solution so that I raise this issue again. I wonder how
  other think of it.

  It's heavily related to the old issue
  (https://bugs.launchpad.net/neutron/+bug/1468236), and I reconstruct
  the issue from my understanding.

  Problems
  - With giant shared provider network which has over than 10000 ports in a network.
  - Several DHCP agents for the network. Even per hypervisor for Calico project.
  - Scalability issue (DHCP lease file is not updated after the VM started) occurs.

  Solutions from the reporter
  1. Add distributed flag for the DHCP agent. And provision DHCP agent on every compute node.
  2. Change DHCP agent notifier to specify DHCP agent per hosts
  3. Do not spread DHCP flow outside of local hypervisor.

  Conclusion
  - Rejected because
  - Solution step (2) add big complexity to agent notifier RPC.
  - (3) is not a general solution.
  - Even worse for migration. There were many side effects to we have to care about.
  - There were building blocks that we can achieve the purpose. (It was mentioned on IRC, but I still does not understand what the building block that mentioned is.)

  Our private cluster is very much like the Calico. We have an giant
  provider network and make them routable with quagga and there were
  DHCP agents per compute node. I believe that community has formed some
  consensus that this kind of architecture is pretty good at handling
  scale issues to see the approach like Routed network.

  And to achieve the architecture with the lack of L2, modifying DHCP
  agent could not be avoided since its default HA behavior make critical
  DB performance issues.

  But at the same time, I absolutely agreed with the comment which care
  about the unnecessary complexity for distributed approach like DVR.

  So What I suggest is
  - Do not modify current DHCP agent behaviors like notifier side API. It does not harm migration logic.
  - Do not change the DHCP HA concept and L2 agent at all.
  - Just add a distributed flag for DHCP agent. And add host filtering logic the handler side RPC (get_active_network_info, get_network_info) only when the DHCP agent is distributed.
  - Operators have little bit new concept of distributed DHCP which the agent is only for ports within a local hypervisor.

  Then we can achieve from the change
  - Reduce the performance overhead. I found the performance penalty is related to DB side (getting ports with get_active_info(), and complete provisioning step with dhcp_ready_on_ports(). RPC fanout is minor.
  - Make new concept which means DHCP agent failure domain is splitted.

  Any comments are appreciated.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1806390/+subscriptions


Follow ups