yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #83621
[Bug 1892200] [NEW] Make keepalived healthcheck more configurable
Public bug reported:
Since the Newton release, users of HA routers have had a keepalived
healthcheck that fails if it doesn't get a response to a single ping or
if the expected tenant network address is not configured in the local
namespace being watched. While this works for most cases where an
environment is stable it appears to produce a lot of instability as soon
as an environment gets loaded or a node fails and transitions/failovers
occur. An example of this appears to be where transitions of the MASTER
to a new node take a little longer than they should. For example we have
seen in the field that under heavy load a node can, for a very short
period of time, have the external network address that keepalived is
tracking be configured on two interfaces/hosts at once and while neutron
is still doing its garp updates it is possible that a ping from the new
master router can fail to get a response for 50% of requests since the
switch may still send the reply to either the new master or the old one.
In order to avoid transient problems like this from causing further
instability we would like to be able to make the healthcheck a little
more tolerant of transient issues. Currently the healthcheck script is
generated by Neutron for each router and its contents are not
configurable. It would be great to be able to change e.g. the number of
pings that it will do before declaring a failure.
** Affects: neutron
Importance: Undecided
Status: New
** Description changed:
- Since the Newton release we have had a keepalived healthcheck that fails
- if it doesn't get a response to a single ping or if the expected tenant
- network address is not configured in the local namespace being watched.
- While this works for most cases where an environment is stable it
- appears to produce a lot of instability as soon as an environment gets
- loaded or a node fails and transitions/failovers occur. An example of
- this appears to be where transitions of the MASTER to a new node take a
- little longer than they should. For example we have seen in the field
- that under heavy load a node can, for a very short period of time, have
- the external network address that keepalived is tracking be configured
- on two interfaces/hosts at once and while neutron is still doing its
- garp updates it is possible that a ping from the new master router can
- fail to get a response for 50% of requests since the switch may still
- send the reply to either the new master or the old one.
+ Since the Newton release, users of HA routers have had a keepalived
+ healthcheck that fails if it doesn't get a response to a single ping or
+ if the expected tenant network address is not configured in the local
+ namespace being watched. While this works for most cases where an
+ environment is stable it appears to produce a lot of instability as soon
+ as an environment gets loaded or a node fails and transitions/failovers
+ occur. An example of this appears to be where transitions of the MASTER
+ to a new node take a little longer than they should. For example we have
+ seen in the field that under heavy load a node can, for a very short
+ period of time, have the external network address that keepalived is
+ tracking be configured on two interfaces/hosts at once and while neutron
+ is still doing its garp updates it is possible that a ping from the new
+ master router can fail to get a response for 50% of requests since the
+ switch may still send the reply to either the new master or the old one.
In order to avoid transient problems like this from causing further
instability we would like to be able to make the healthcheck a little
more tolerant of transient issues. Currently the healthcheck script is
generated by Neutron for each router and its contents are not
configurable. It would be great to be able to change e.g. the number of
pings that it will do before declaring a failure.
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1892200
Title:
Make keepalived healthcheck more configurable
Status in neutron:
New
Bug description:
Since the Newton release, users of HA routers have had a keepalived
healthcheck that fails if it doesn't get a response to a single ping
or if the expected tenant network address is not configured in the
local namespace being watched. While this works for most cases where
an environment is stable it appears to produce a lot of instability as
soon as an environment gets loaded or a node fails and
transitions/failovers occur. An example of this appears to be where
transitions of the MASTER to a new node take a little longer than they
should. For example we have seen in the field that under heavy load a
node can, for a very short period of time, have the external network
address that keepalived is tracking be configured on two
interfaces/hosts at once and while neutron is still doing its garp
updates it is possible that a ping from the new master router can fail
to get a response for 50% of requests since the switch may still send
the reply to either the new master or the old one.
In order to avoid transient problems like this from causing further
instability we would like to be able to make the healthcheck a little
more tolerant of transient issues. Currently the healthcheck script is
generated by Neutron for each router and its contents are not
configurable. It would be great to be able to change e.g. the number
of pings that it will do before declaring a failure.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1892200/+subscriptions