← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1892200] [NEW] Make keepalived healthcheck more configurable

 

Public bug reported:

Since the Newton release, users of HA routers have had a keepalived
healthcheck that fails if it doesn't get a response to a single ping or
if the expected tenant network address is not configured in the local
namespace being watched. While this works for most cases where an
environment is stable it appears to produce a lot of instability as soon
as an environment gets loaded or a node fails and transitions/failovers
occur. An example of this appears to be where transitions of the MASTER
to a new node take a little longer than they should. For example we have
seen in the field that under heavy load a node can, for a very short
period of time, have the external network address that keepalived is
tracking be configured on two interfaces/hosts at once and while neutron
is still doing its garp updates it is possible that a ping from the new
master router can fail to get a response for 50% of requests since the
switch may still send the reply to either the new master or the old one.

In order to avoid transient problems like this from causing further
instability we would like to be able to make the healthcheck a little
more tolerant of transient issues. Currently the healthcheck script is
generated by Neutron for each router and its contents are not
configurable. It would be great to be able to change e.g. the number of
pings that it will do before declaring a failure.

** Affects: neutron
     Importance: Undecided
         Status: New

** Description changed:

- Since the Newton release we have had a keepalived healthcheck that fails
- if it doesn't get a response to a single ping or if the expected tenant
- network address is not configured in the local namespace being watched.
- While this works for most cases where an environment is stable it
- appears to produce a lot of instability as soon as an environment gets
- loaded or a node fails and transitions/failovers occur. An example of
- this appears to be where transitions of the MASTER to a new node take a
- little longer than they should. For example we have seen in the field
- that under heavy load a node can, for a very short period of time, have
- the external network address that keepalived is tracking be configured
- on two interfaces/hosts at once and while neutron is still doing its
- garp updates it is possible that a ping from the new master router can
- fail to get a response for 50% of requests since the switch may still
- send the reply to either the new master or the old one.
+ Since the Newton release, users of HA routers have had a keepalived
+ healthcheck that fails if it doesn't get a response to a single ping or
+ if the expected tenant network address is not configured in the local
+ namespace being watched. While this works for most cases where an
+ environment is stable it appears to produce a lot of instability as soon
+ as an environment gets loaded or a node fails and transitions/failovers
+ occur. An example of this appears to be where transitions of the MASTER
+ to a new node take a little longer than they should. For example we have
+ seen in the field that under heavy load a node can, for a very short
+ period of time, have the external network address that keepalived is
+ tracking be configured on two interfaces/hosts at once and while neutron
+ is still doing its garp updates it is possible that a ping from the new
+ master router can fail to get a response for 50% of requests since the
+ switch may still send the reply to either the new master or the old one.
  
  In order to avoid transient problems like this from causing further
  instability we would like to be able to make the healthcheck a little
  more tolerant of transient issues. Currently the healthcheck script is
  generated by Neutron for each router and its contents are not
  configurable. It would be great to be able to change e.g. the number of
  pings that it will do before declaring a failure.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1892200

Title:
  Make keepalived healthcheck more configurable

Status in neutron:
  New

Bug description:
  Since the Newton release, users of HA routers have had a keepalived
  healthcheck that fails if it doesn't get a response to a single ping
  or if the expected tenant network address is not configured in the
  local namespace being watched. While this works for most cases where
  an environment is stable it appears to produce a lot of instability as
  soon as an environment gets loaded or a node fails and
  transitions/failovers occur. An example of this appears to be where
  transitions of the MASTER to a new node take a little longer than they
  should. For example we have seen in the field that under heavy load a
  node can, for a very short period of time, have the external network
  address that keepalived is tracking be configured on two
  interfaces/hosts at once and while neutron is still doing its garp
  updates it is possible that a ping from the new master router can fail
  to get a response for 50% of requests since the switch may still send
  the reply to either the new master or the old one.

  In order to avoid transient problems like this from causing further
  instability we would like to be able to make the healthcheck a little
  more tolerant of transient issues. Currently the healthcheck script is
  generated by Neutron for each router and its contents are not
  configurable. It would be great to be able to change e.g. the number
  of pings that it will do before declaring a failure.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1892200/+subscriptions