yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1951010] [NEW] Restarting Neutron floods Nova with segment aggregates calls

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Léo Gillot-Lamure <1951010@xxxxxxxxxxxxxxxxxx>
Date: Mon, 15 Nov 2021 19:34:32 -0000
Reply-to: Bug 1951010 <1951010@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

* High level description:

Whenever we restart neutron-server, we see a huge number of requests
(hundreds of thousands, overwhelming our control plane over the course
of ~8 hours) going from neutron-server to nova-api. These requests are
related to the segment aggregates, more specifically we see, for each of
our hypervisors and for each segment, a GET to [URL] in order to get the
aggregate id for the segment and a POST to [URL] in order to (try to)
add the hypervisor to the aggregate, which fails because the host
already exists there. These calls originate from this
"_add_host_to_aggregate" method:
https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/plugin.py#L403
and we see the 'Host %(host)s already exists in aggregate for routed
network segment %(segment_id)s' message in our logs.

To be more exact, we see more than one such log per pair (host id, segment id), we see it several times, more often in the beginning and then more rarely, until it doesn't appear anymore. We determined that this is because each of the RPC worker processes engages in the same procedure independently. This procedure is the following:
- When a process starts, the variable "reported_hosts" here https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L252 gets initialized to the empty set.
- Each time a neutron agent on an hypervisor (in our case the openvswitch_agent) does its periodical state report, it sends a message over RabbitMQ
- This message gets randomly picked by one of the neutron-server worker processes. We enter the method "_update_segment_host_mapping_for_agent" https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L283. After a few check, we arrive at this line "if host in reported_hosts and not start_flag: return" (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L298). Let's assume that start_flag is false (we verified it to be the case), because reported_hosts is initially empty, we do not return and go ahead, adding the host to the "reported_hosts" set (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L300) and eventually reaching the "_add_host_to_aggregate" method in plugin.py and triggering one series of calls for the agent (hence hypervisor) originating the message, targeting each segment aggregate.
- Then, the next time the agent for this same hypervisor sends a state report message, there is two possibilities:
  - case 1, the message is picked by the same RPC worker as before. We arrive to the line "if host in reported_hosts and not start_flag: return" but this time the hypervisor is already in the reported_hosts set, hence we return and no request is sent
  - case 2, the message is picked by a different RPC worker. We then arrive in the same scenario as for the previous message.
  At the beginning, just after neutron-server has been restarted, case 1 is highly improbable (let's say we have 50 RPC workers over our whole control plane, for a given hypervisor you have 1/50 chances of hitting the same RPC worker, then on the next message 2/50 chances, etc.). On the other hand case 2 is highly probable, for the same reason (nearly all RPC workers have the "reported_hosts" variable not containing this hypervisor since its initially empty). This is why we see an enormous amount of messages sent as the beginning and a trickle at the end.
- Ultimately, this hypervisor is present in the "reported_hosts" variable in all RPC workers and we stop getting calls, this is the stable state of the system and the reason why the calls only start when we restart neutron-server.

Included as attachment is an excerpt of logs using a debug line that we
added in our code in our test environment, that shows the
"reported_hosts" set growing in parallel for each RPC worker process.

This whole procedure is happening in parallel for all hosts. The
important number of times the same series of calls for the same
hypervisor is repeated is due to the fact that the "reported_hosts" set
is a simple Python variable, purely local to each RPC worker process. In
our case we have 4 neutron-server instances that each have 9 regular RPC
workers and 9 state report RPC worker, hence the number of processes is
4×18=72 copies of the variable.

We calculated that a rolling restart of our neutron-server instances
(say for an update deployment) will ultimately generate 300 hypervisors
× 100 networks × 72 processes × 2 (one GET then one POST) = 4 320 000
calls to Nova.

* Workaround

Included is a minimal patch that we are considering applying to our
internal branch that disables the whole "re-register to Nova at each
restart" logic, while still keeping the possibility to execute this re-
registration by restarting the neutron openvswitch agent on the
hypervisors (thanks to the start_flag).

* Pre-conditions: what is the initial state of your system? Please
consider enumerating resources available in the system, if useful in
diagnosing the problem. Who are you? A regular tenant or a super-user?
Are you describing service-to-service interaction?

We're operating a quite big deployment, our biggest region where this
issue is the most impactful is 300 hypervisors over 2 AZs, 100 provider
routed networks (we don't do SDN), 4 baremetal control plane nodes with
24 hyperthreads and 250GB RAM each, on which all Openstack servers are
deployed (notably Nova, Neutron, Glance, RabbitMQ and HAProxy) except
MySQL Galera which is alone on another 4 baremetal nodes.

* Step-by-step reproduction steps:

$ docker restart neutron_server

* Expected output: nothing

* Actual output: Hundreds of thousands of calls are emitted from
neutron-server to nova-api, crippling our control plane for hours.

* Version:
  ** OpenStack version: Ussuri
  ** Linux distro, kernel: CentOS 7 for the host, CentOS 8 for the Kolla containers
  ** DevStack or other _deployment_ mechanism: Kolla-ansible 10.2

* Perceived severity: the call flood is basically taking our control
plane down due to the load for the first 4 hours, and severely impacting
its performance for the 4 next hours, everytime we restart neutron-
server. As a result we are severely restrained in our capacity to
restart neutron-server, which should be a non-event.

** Affects: neutron
     Importance: Undecided
         Status: New

** Attachment added: "Logs showing the evolution of "reported_hosts""
   https://bugs.launchpad.net/bugs/1951010/+attachment/5541058/+files/investigation_neutron.txt

** Description changed:

  * High level description:
  
- Whenever we restart neutron-server, we see a huge number of requests (hundreds of thousands, overwhelming our control
- plane over the course of ~8 hours) going from neutron-server to nova-api. These requests are related to the segment
- aggregates, more specifically we see, for each of our hypervisors and for each segment, a GET to [URL] in order to get
- the aggregate id for the segment and a POST to [URL] in order to (try to) add the hypervisor to the aggregate, which
- fails because the host already exists there. These calls originate from this "_add_host_to_aggregate" method:
+ Whenever we restart neutron-server, we see a huge number of requests
+ (hundreds of thousands, overwhelming our control plane over the course
+ of ~8 hours) going from neutron-server to nova-api. These requests are
+ related to the segment aggregates, more specifically we see, for each of
+ our hypervisors and for each segment, a GET to [URL] in order to get the
+ aggregate id for the segment and a POST to [URL] in order to (try to)
+ add the hypervisor to the aggregate, which fails because the host
+ already exists there. These calls originate from this
+ "_add_host_to_aggregate" method:
  https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/plugin.py#L403
- and we see the 'Host %(host)s already exists in aggregate for routed network segment %(segment_id)s' message in our
- logs.
+ and we see the 'Host %(host)s already exists in aggregate for routed
+ network segment %(segment_id)s' message in our logs.
  
- To be more exact, we see more than one such log per pair (host id, segment id), we see it several times, more often in
- the beginning and then more rarely, until it doesn't appear anymore. We determined that this is because each of the RPC
- worker processes engages in the same procedure independently. This procedure is the following:
+ To be more exact, we see more than one such log per pair (host id, segment id), we see it several times, more often in the beginning and then more rarely, until it doesn't appear anymore. We determined that this is because each of the RPC worker processes engages in the same procedure independently. This procedure is the following:
  - When a process starts, the variable "reported_hosts" here https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L252 gets initialized to the empty set.
  - Each time a neutron agent on an hypervisor (in our case the openvswitch_agent) does its periodical state report, it sends a message over RabbitMQ
- - This message gets randomly picked by one of the neutron-server worker processes. We enter the method
-   "_update_segment_host_mapping_for_agent"
-   https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L283
-   After a few check, we arrive at this line "if host in reported_hosts and not start_flag: return"
-   (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L298). Let's
-   assume that start_flag is false (we verified it to be the case), because reported_hosts is initially empty, we do not
-   return and go ahead, adding the host to the "reported_hosts" set
-   (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L300)
-   and eventually reaching the "_add_host_to_aggregate" method in plugin.py and triggering one series of calls for the
-   agent (hence hypervisor) originating the message, targeting each segment aggregate.
+ - This message gets randomly picked by one of the neutron-server worker processes. We enter the method "_update_segment_host_mapping_for_agent" https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L283. After a few check, we arrive at this line "if host in reported_hosts and not start_flag: return" (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L298). Let's assume that start_flag is false (we verified it to be the case), because reported_hosts is initially empty, we do not return and go ahead, adding the host to the "reported_hosts" set (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L300) and eventually reaching the "_add_host_to_aggregate" method in plugin.py and triggering one series of calls for the agent (hence hypervisor) originating the message, targeting each segment aggregate.
  - Then, the next time the agent for this same hypervisor sends a state report message, there is two possibilities:
-   - case 1, the message is picked by the same RPC worker as before. We arrive to the line "if host in reported_hosts and
-     not start_flag: return" but this time the hypervisor is already in the reported_hosts set, hence we return and no
-     request is sent
+   - case 1, the message is picked by the same RPC worker as before. We arrive to the line "if host in reported_hosts and not start_flag: return" but this time the hypervisor is already in the reported_hosts set, hence we return and no request is sent
    - case 2, the message is picked by a different RPC worker. We then arrive in the same scenario as for the previous message.
-   At the beginning, just after neutron-server has been restarted, case 1 is highly improbable (let's say we have 50 RPC
-   workers over our whole control plane, for a given hypervisor you have 1/50 chances of hitting the same RPC worker, then
-   on the next message 2/50 chances, etc.). On the other hand case 2 is highly probable, for the same reason (nearly all
-   RPC workers have the "reported_hosts" variable not containing this hypervisor since its initially empty). This is why we
-   see an enormous amount of messages sent as the beginning and a trickle at the end.
- - Ultimately, this hypervisor is present in the "reported_hosts" variable in all RPC workers and we stop getting calls,
-   this is the stable state of the system and the reason why the calls only start when we restart neutron-server.
+   At the beginning, just after neutron-server has been restarted, case 1 is highly improbable (let's say we have 50 RPC workers over our whole control plane, for a given hypervisor you have 1/50 chances of hitting the same RPC worker, then on the next message 2/50 chances, etc.). On the other hand case 2 is highly probable, for the same reason (nearly all RPC workers have the "reported_hosts" variable not containing this hypervisor since its initially empty). This is why we see an enormous amount of messages sent as the beginning and a trickle at the end.
+ - Ultimately, this hypervisor is present in the "reported_hosts" variable in all RPC workers and we stop getting calls, this is the stable state of the system and the reason why the calls only start when we restart neutron-server.
  
  Included as attachment is an excerpt of logs using a debug line that we
  added in our code in our test environment, that shows the
  "reported_hosts" set growing in parallel for each RPC worker process.
  
  This whole procedure is happening in parallel for all hosts. The
  important number of times the same series of calls for the same
  hypervisor is repeated is due to the fact that the "reported_hosts" set
  is a simple Python variable, purely local to each RPC worker process. In
  our case we have 4 neutron-server instances that each have 9 regular RPC
  workers and 9 state report RPC worker, hence the number of processes is
  4×18=72 copies of the variable.
  
  We calculated that a rolling restart of our neutron-server instances
- (say for an update deployment) will ultimately generate: 300 hypervisors
+ (say for an update deployment) will ultimately generate 300 hypervisors
  × 100 networks × 72 processes × 2 (one GET then one POST) = 4 320 000
  calls to Nova.
  
  * Workaround
  
  Included is a minimal patch that we are considering applying to our
  internal branch that disables the whole "re-register to Nova at each
  restart" logic, while still keeping the possibility to execute this re-
  registration by restarting the neutron openvswitch agent on the
  hypervisors (thanks to the start_flag).
  
  * Pre-conditions: what is the initial state of your system? Please
  consider enumerating resources available in the system, if useful in
  diagnosing the problem. Who are you? A regular tenant or a super-user?
  Are you describing service-to-service interaction?
  
- We're operating a quite big deployment, our biggest region where this issue is the most impactful is 300 hypervisors
- over 2 AZs, 100 provider routed networks (we don't do SDN), 4 baremetal control plane nodes with 24 hyperthreads and
- 250GB RAM each, on which all Openstack servers are deployed (notably Nova, Neutron, Glance, RabbitMQ and HAProxy) except
- Galera MySQL which is alone on another 4 baremetal nodes.
+ We're operating a quite big deployment, our biggest region where this
+ issue is the most impactful is 300 hypervisors over 2 AZs, 100 provider
+ routed networks (we don't do SDN), 4 baremetal control plane nodes with
+ 24 hyperthreads and 250GB RAM each, on which all Openstack servers are
+ deployed (notably Nova, Neutron, Glance, RabbitMQ and HAProxy) except
+ MySQL Galera which is alone on another 4 baremetal nodes.
  
  * Step-by-step reproduction steps:
  
  $ docker restart neutron_server
  
  * Expected output: nothing
  
  * Actual output: Hundreds of thousands of calls are emitted from
  neutron-server to nova-api, crippling our control plane for hours.
  
  * Version:
    ** OpenStack version: Ussuri
    ** Linux distro, kernel: CentOS 7 for the host, CentOS 8 for the Kolla containers
    ** DevStack or other _deployment_ mechanism: Kolla-ansible 10.2
  
  * Perceived severity: the call flood is basically taking our control
  plane down due to the load for the first 4 hours, and severely impacting
  its performance for the 4 next hours, everytime we restart neutron-
  server. As a result we are severely restrained in our capacity to
  restart neutron-server, which should be a non-event.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1951010

Title:
  Restarting Neutron floods Nova with segment aggregates calls

Status in neutron:
  New

Bug description:
  * High level description:

  Whenever we restart neutron-server, we see a huge number of requests
  (hundreds of thousands, overwhelming our control plane over the course
  of ~8 hours) going from neutron-server to nova-api. These requests are
  related to the segment aggregates, more specifically we see, for each
  of our hypervisors and for each segment, a GET to [URL] in order to
  get the aggregate id for the segment and a POST to [URL] in order to
  (try to) add the hypervisor to the aggregate, which fails because the
  host already exists there. These calls originate from this
  "_add_host_to_aggregate" method:
  https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/plugin.py#L403
  and we see the 'Host %(host)s already exists in aggregate for routed
  network segment %(segment_id)s' message in our logs.

  To be more exact, we see more than one such log per pair (host id, segment id), we see it several times, more often in the beginning and then more rarely, until it doesn't appear anymore. We determined that this is because each of the RPC worker processes engages in the same procedure independently. This procedure is the following:
  - When a process starts, the variable "reported_hosts" here https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L252 gets initialized to the empty set.
  - Each time a neutron agent on an hypervisor (in our case the openvswitch_agent) does its periodical state report, it sends a message over RabbitMQ
  - This message gets randomly picked by one of the neutron-server worker processes. We enter the method "_update_segment_host_mapping_for_agent" https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L283. After a few check, we arrive at this line "if host in reported_hosts and not start_flag: return" (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L298). Let's assume that start_flag is false (we verified it to be the case), because reported_hosts is initially empty, we do not return and go ahead, adding the host to the "reported_hosts" set (https://opendev.org/openstack/neutron/src/commit/ee67324c17c7d46bf110da977d999796951a6a2d/neutron/services/segments/db.py#L300) and eventually reaching the "_add_host_to_aggregate" method in plugin.py and triggering one series of calls for the agent (hence hypervisor) originating the message, targeting each segment aggregate.
  - Then, the next time the agent for this same hypervisor sends a state report message, there is two possibilities:
    - case 1, the message is picked by the same RPC worker as before. We arrive to the line "if host in reported_hosts and not start_flag: return" but this time the hypervisor is already in the reported_hosts set, hence we return and no request is sent
    - case 2, the message is picked by a different RPC worker. We then arrive in the same scenario as for the previous message.
    At the beginning, just after neutron-server has been restarted, case 1 is highly improbable (let's say we have 50 RPC workers over our whole control plane, for a given hypervisor you have 1/50 chances of hitting the same RPC worker, then on the next message 2/50 chances, etc.). On the other hand case 2 is highly probable, for the same reason (nearly all RPC workers have the "reported_hosts" variable not containing this hypervisor since its initially empty). This is why we see an enormous amount of messages sent as the beginning and a trickle at the end.
  - Ultimately, this hypervisor is present in the "reported_hosts" variable in all RPC workers and we stop getting calls, this is the stable state of the system and the reason why the calls only start when we restart neutron-server.

  Included as attachment is an excerpt of logs using a debug line that
  we added in our code in our test environment, that shows the
  "reported_hosts" set growing in parallel for each RPC worker process.

  This whole procedure is happening in parallel for all hosts. The
  important number of times the same series of calls for the same
  hypervisor is repeated is due to the fact that the "reported_hosts"
  set is a simple Python variable, purely local to each RPC worker
  process. In our case we have 4 neutron-server instances that each have
  9 regular RPC workers and 9 state report RPC worker, hence the number
  of processes is 4×18=72 copies of the variable.

  We calculated that a rolling restart of our neutron-server instances
  (say for an update deployment) will ultimately generate 300
  hypervisors × 100 networks × 72 processes × 2 (one GET then one POST)
  = 4 320 000 calls to Nova.

  * Workaround

  Included is a minimal patch that we are considering applying to our
  internal branch that disables the whole "re-register to Nova at each
  restart" logic, while still keeping the possibility to execute this
  re-registration by restarting the neutron openvswitch agent on the
  hypervisors (thanks to the start_flag).

  * Pre-conditions: what is the initial state of your system? Please
  consider enumerating resources available in the system, if useful in
  diagnosing the problem. Who are you? A regular tenant or a super-user?
  Are you describing service-to-service interaction?

  We're operating a quite big deployment, our biggest region where this
  issue is the most impactful is 300 hypervisors over 2 AZs, 100
  provider routed networks (we don't do SDN), 4 baremetal control plane
  nodes with 24 hyperthreads and 250GB RAM each, on which all Openstack
  servers are deployed (notably Nova, Neutron, Glance, RabbitMQ and
  HAProxy) except MySQL Galera which is alone on another 4 baremetal
  nodes.

  * Step-by-step reproduction steps:

  $ docker restart neutron_server

  * Expected output: nothing

  * Actual output: Hundreds of thousands of calls are emitted from
  neutron-server to nova-api, crippling our control plane for hours.

  * Version:
    ** OpenStack version: Ussuri
    ** Linux distro, kernel: CentOS 7 for the host, CentOS 8 for the Kolla containers
    ** DevStack or other _deployment_ mechanism: Kolla-ansible 10.2

  * Perceived severity: the call flood is basically taking our control
  plane down due to the load for the first 4 hours, and severely
  impacting its performance for the 4 next hours, everytime we restart
  neutron-server. As a result we are severely restrained in our capacity
  to restart neutron-server, which should be a non-event.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1951010/+subscriptions