yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #96431
[Bug 2124215] [NEW] [RFE] Implement more graceful handling of dhcp_lease_duration reduction
Public bug reported:
If the dhcp_lease_duration is reduced by more than half its previous
value (ex, from the default of 24h to 8h), an interesting scenario is
setup.
Once the neutron-dhcp-agent process is restarted to pick up the new
config value, the existing dnsmasq lease information (leases file) is
discarded. A new init leases file is created and dnsmasq is seeded with
new lease information using the new dhcp_lease_duration setting of 8h.
If a VM had just acquired a 24h lease prior to the neutron-dhcp-agent
restart, it won’t be due to renew its lease for 12h (half the lease
duration). However, dnsmasq will expire the lease as written to the
init leases file after 8h from agent startup. When the VM does try to
renew after an additional 4h has passed (12h after agent startup),
dnsmasq will issue a NAK upon the renewal attempt since there is no
active lease for that IP, forcing the client to perform a full DORA
cycle to reacquire a new lease (for the same IP).
Linux handles this NAK gracefully as it retains the current IP
(retaining all active connections) while it performs a DORA cycle to
acquire a new lease for the current IP.
However, Windows does not handle this NAK gracefully. Upon receipt of
the NAK, Windows immediately releases the IP (dropping all active
connections) while it performs a DORA cycle to acquire a new lease for
the IP.
This is likely a rare edge case situation as the dhcp_lease_duration
setting isn’t usually modified but the impact can be very large,
especially for Windows VMs if it’s reduced too much, too quickly without
enough time for clients to renew their lease and pick up the new,
shorter lease duration.
This brings up a couple of questions:
- Should we expect operators to “know better” when updating this value?
- Is there an opportunity to more gracefully handle the init leases file to help with this situation?
Instead of seeding the lease timeout value with `int(time.time()) +
self.conf.dhcp_lease_duration`, could we default it to 0 (infinite) in
all cases? Doing so would ensure that dnsmasq doesn’t expire the lease
prematurely. Upon the VM’s next DHCP renewal, dnsmasq would update the
lease time with the correct duration based on current config, same as is
done presently.
The downside would be that lease entries for active/valid ports which do
not have an active VM would remain in the leases file (active)
indefinitely as there would be no VM to renew the lease and update the
lease expiration. However, given the specific way that Neutron uses
dnsmasq, this may not be of concern as only active ports are seeded in
the init leases file. Furthermore, once a Neutron port is deleted, the
lease is released as currently implemented in _release_unused_leases().
Initializing the lease timeout to 0 seems like it may help guard against
this undesired behavior.
I'm interested to understand if there's a desire to implement this or a
similar change to improve this situation.
** Affects: neutron
Importance: Undecided
Status: New
** Tags: l3-ipam-dhcp
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2124215
Title:
[RFE] Implement more graceful handling of dhcp_lease_duration
reduction
Status in neutron:
New
Bug description:
If the dhcp_lease_duration is reduced by more than half its previous
value (ex, from the default of 24h to 8h), an interesting scenario is
setup.
Once the neutron-dhcp-agent process is restarted to pick up the new
config value, the existing dnsmasq lease information (leases file) is
discarded. A new init leases file is created and dnsmasq is seeded
with new lease information using the new dhcp_lease_duration setting
of 8h.
If a VM had just acquired a 24h lease prior to the neutron-dhcp-agent
restart, it won’t be due to renew its lease for 12h (half the lease
duration). However, dnsmasq will expire the lease as written to the
init leases file after 8h from agent startup. When the VM does try to
renew after an additional 4h has passed (12h after agent startup),
dnsmasq will issue a NAK upon the renewal attempt since there is no
active lease for that IP, forcing the client to perform a full DORA
cycle to reacquire a new lease (for the same IP).
Linux handles this NAK gracefully as it retains the current IP
(retaining all active connections) while it performs a DORA cycle to
acquire a new lease for the current IP.
However, Windows does not handle this NAK gracefully. Upon receipt of
the NAK, Windows immediately releases the IP (dropping all active
connections) while it performs a DORA cycle to acquire a new lease for
the IP.
This is likely a rare edge case situation as the dhcp_lease_duration
setting isn’t usually modified but the impact can be very large,
especially for Windows VMs if it’s reduced too much, too quickly
without enough time for clients to renew their lease and pick up the
new, shorter lease duration.
This brings up a couple of questions:
- Should we expect operators to “know better” when updating this value?
- Is there an opportunity to more gracefully handle the init leases file to help with this situation?
Instead of seeding the lease timeout value with `int(time.time()) +
self.conf.dhcp_lease_duration`, could we default it to 0 (infinite) in
all cases? Doing so would ensure that dnsmasq doesn’t expire the
lease prematurely. Upon the VM’s next DHCP renewal, dnsmasq would
update the lease time with the correct duration based on current
config, same as is done presently.
The downside would be that lease entries for active/valid ports which
do not have an active VM would remain in the leases file (active)
indefinitely as there would be no VM to renew the lease and update the
lease expiration. However, given the specific way that Neutron uses
dnsmasq, this may not be of concern as only active ports are seeded in
the init leases file. Furthermore, once a Neutron port is deleted, the
lease is released as currently implemented in
_release_unused_leases().
Initializing the lease timeout to 0 seems like it may help guard
against this undesired behavior.
I'm interested to understand if there's a desire to implement this or
a similar change to improve this situation.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2124215/+subscriptions