yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1940950] [NEW] [ovn] neutron api worker gets overloaded processing chassis_private updates

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Krzysztof Klimonda <1940950@xxxxxxxxxxxxxxxxxx>
Date: Tue, 24 Aug 2021 12:09:16 -0000
Reply-to: Bug 1940950 <1940950@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Public bug reported:

This was tested with stable/ussuri branch with
https://review.opendev.org/c/openstack/neutron/+/752795/ backported.

The test setup was 3 controllers, each with 10 api workers and rpc
workers, with 250 chassis running ovn-controller. There are 1k networks
and 10k ports in total (4k vm ports, 2k ports for fips, 4k ports for
routers), 1k routers connected to the same external network, 2k vms (2
vms per network, and all vms additionally connected to a single shared
network between them). Northbound DB is 15MB, Southbound DB is 100MB.

When change is made in neutron, an update in ovn is created and
NB_Global.nb_cfg field is incremented. This translates into
SB_Global.nb_cfg change which is picked by all ovn-controllers, which in
turn update their entry in Chassis_Private, incrementing
Chasiss_Private.nb_cfg.

After that, southbound ovsdb sends update to neutron either due to
https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#249
or
https://review.opendev.org/c/openstack/neutron/+/752795/29/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#264
which is then handled by Hash Ring implementation to send update the
worker.

In my testing, when that happened, all neutron api workers stopped
processing API requests until all Chassis_Private events were handled
which took around 30 seconds on each nb_cfg update. This could be due to
controller nodes in test environment not being scaled up properly, but
it seems to be a potential scaling issue.

** Affects: neutron
Importance: Undecided
Status: New

--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1940950

Title:
[ovn] neutron api worker gets overloaded processing chassis_private
updates

Status in neutron:
New

Bug description:
This was tested with stable/ussuri branch with
https://review.opendev.org/c/openstack/neutron/+/752795/ backported.

The test setup was 3 controllers, each with 10 api workers and rpc
workers, with 250 chassis running ovn-controller. There are 1k
networks and 10k ports in total (4k vm ports, 2k ports for fips, 4k
ports for routers), 1k routers connected to the same external network,
2k vms (2 vms per network, and all vms additionally connected to a
single shared network between them). Northbound DB is 15MB, Southbound
DB is 100MB.

When change is made in neutron, an update in ovn is created and
NB_Global.nb_cfg field is incremented. This translates into
SB_Global.nb_cfg change which is picked by all ovn-controllers, which
in turn update their entry in Chassis_Private, incrementing
Chasiss_Private.nb_cfg.

In my testing, when that happened, all neutron api workers stopped
processing API requests until all Chassis_Private events were handled
which took around 30 seconds on each nb_cfg update. This could be due
to controller nodes in test environment not being scaled up properly,
but it seems to be a potential scaling issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1940950/+subscriptions