yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #95911
[Bug 2111498] [NEW] [OVN] Mechanism driver fails to fix router_ports: KeyError 'neutron:provnet-network-type'
Public bug reported:
## General description:
After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
Antelope to Caracal(2024.1), we found ourselves with a few resources
stuck with somehow bad states in neutron. More specifically, these
resources are in an invalid state, and can no longer be fixed through
the usual APIs.
>From what I found out, it seems that there are some discrepancies
between Neutron and OVN databases, which the Mech Driver is unable to
fix automatically, and in turn, prevent the resources from being updated
by the neutron-server service.
## Context on the deployment & bug
Openstack version: 2024.1
Control-Plane Deployment: over Kubernetes, using community helm charts
Compute deployment: initramfs-booted (do not keep state between two boots, except the compute_id which is laid down by custom services upon initializing)
It may be important to note that due to our deployment method for the
compute nodes, our upgrade procedure for these is a simple reboot, with
required steps to disable the node in Nova. We currently do not handle
neutron's agent, which might mean that the operation is brutal from
neutron/ovn's point of view, akin to an unexpected server crash.
## Traces and useful outputs:
Initially, a user reported the inability to update a router's routes,
due to the service answering with an HTTP 500. Looking at neutron-server
logs, nothing useful was shown about the error, apart from the access
log line.
While looking at the logs from all neutron services, I ended up finding
a bunch of logs similar to the following (always the same KeyError, on
various router_ports) in the neutron-server logs:
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports): KeyError: 'neutron:provnet-network-type'
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most recent call last):
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 377, in check_for_inconsistencies
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance self._fix_create_update(admin_context, row)
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 286, in _fix_create_update
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance res_map['ovn_update'](context, n_obj)
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1878, in update_router_port
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance self._update_lrouter_port(context, port, if_exists=if_exists,
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1863, in _update_lrouter_port
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance options=self._gen_router_port_options(port),
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1701, in _gen_router_port_options
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError: 'neutron:provnet-network-type'
I ended up finding that one of these errors was related to the resource update issue I had been called for, by checking out the northbound OVS database:
> switch fbb1b26a-b915-454e-81e8-6600e1a70811 (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
> ...
> port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> type: router
> router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> ...
> router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20 (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
> port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
> mac: "REDACTED"
> networks: ["REDACTED"]
> port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> mac: "REDACTED"
> networks: ["REDACTED"]
> gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
> nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
> external ip: "REDACTED"
> logical ip: "REDACTED"
> type: "snat"
Here, one of the 5 computes from the router is one that has been unavailable since the upgrade (COMPUTE5).
I wondered if that could be a factor, but then, I checked out the
southbound OVN Db for the port, to find out that it wasn't the active
chassis for the port:
> Chassis COMPUTE3
> hostname: COMPUTE3
> Encap geneve
> ip: "REDACTED"
> options: {csum="true"}
> Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
All I could tell so far is that the neutron router that we're unable to update is somehow related to an OVH router_port that the neutron-server cannot automatically fix.
The
## Reproduction steps:
Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
what could have gone wrong. All I can say is that we noticed these
errors after upgrading our openstack deployment. It is possible that our
reboots might be too brutal for neutron/ovn as is, and that we need to
refine our upgrade procedures.
## Perceived Severity
A few existing resources are now unusable for our users, but new resources are unaffected.
Given that our users expect to rely on existing resources, I'd put this issue as a blocker, but to the community it might sound like a High priority instead.
## Expectations
I hope to learn what we may have done wrong if our procedure is the
cause, and learn how I can fix the current situation manually, to
unblock our users
I'll stay available on this BugReport to provide any additional insight
I can. Hopefully this initial version provides enough information.
** Affects: neutron
Importance: Undecided
Status: New
** Tags: ovn
** Description changed:
## General description:
After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
Antelope to Caracal(2024.1), we found ourselves with a few resources
stuck with somehow bad states in neutron. More specifically, these
resources are in an invalid state, and can no longer be fixed through
the usual APIs.
From what I found out, it seems that there are some discrepancies
between Neutron and OVN databases, which the Mech Driver is unable to
fix automatically, and in turn, prevent the resources from being updated
by the neutron-server service.
-
## Context on the deployment & bug
Openstack version: 2024.1
Control-Plane Deployment: over Kubernetes, using community helm charts
Compute deployment: initramfs-booted (do not keep state between two boots, except the compute_id which is laid down by custom services upon initializing)
It may be important to note that due to our deployment method for the
compute nodes, our upgrade procedure for these is a simple reboot, with
required steps to disable the node in Nova. We currently do not handle
neutron's agent, which might mean that the operation is brutal from
neutron/ovn's point of view, akin to an unexpected server crash.
-
## Traces and useful outputs:
Initially, a user reported the inability to update a router's routes,
due to the service answering with an HTTP 500. Looking at neutron-server
logs, nothing useful was shown about the error, apart from the access
log line.
While looking at the logs from all neutron services, I ended up finding
a bunch of logs similar to the following (always the same KeyError, on
various router_ports) in the neutron-server logs:
```
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports): KeyError: 'neutron:provnet-network-type'
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most recent call last):
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 377, in check_for_inconsistencies
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance self._fix_create_update(admin_context, row)
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 286, in _fix_create_update
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance res_map['ovn_update'](context, n_obj)
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1878, in update_router_port
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance self._update_lrouter_port(context, port, if_exists=if_exists,
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1863, in _update_lrouter_port
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance options=self._gen_router_port_options(port),
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1701, in _gen_router_port_options
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError: 'neutron:provnet-network-type'
```
I ended up finding that one of these errors was related to the resource
update issue I had been called for, by checking out the northbound OVS
database:
```
...
switch fbb1b26a-b915-454e-81e8-6600e1a70811 (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
...
- port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- type: router
- router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ type: router
+ router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
...
router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20 (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
- port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
- mac: "REDACTED"
- networks: ["REDACTED"]
- port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- mac: "REDACTED"
- networks: ["REDACTED"]
- gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
- nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
- external ip: "REDACTED"
- logical ip: "REDACTED"
- type: "snat"
+ port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
+ mac: "REDACTED"
+ networks: ["REDACTED"]
+ port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ mac: "REDACTED"
+ networks: ["REDACTED"]
+ gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
+ nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
+ external ip: "REDACTED"
+ logical ip: "REDACTED"
+ type: "snat"
```
Here, one of the 5 computes from the router is one that has been
unavailable since the upgrade (COMPUTE5).
I wondered if that could be a factor, but then, I checked out the
southbound OVN Db for the port, to find out that it wasn't the active
chassis for the port:
```
...
Chassis COMPUTE3
- hostname: COMPUTE3
- Encap geneve
- ip: "REDACTED"
- options: {csum="true"}
- Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ hostname: COMPUTE3
+ Encap geneve
+ ip: "REDACTED"
+ options: {csum="true"}
+ Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
...
```
All I could tell so far is that the neutron router that we're unable to update is somehow related to an OVH router_port that the neutron-server cannot automatically fix.
- The
-
+ The
## Reproduction steps:
Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
what could have gone wrong. All I can say is that we noticed these
errors after upgrading our openstack deployment. It is possible that our
reboots might be too brutal for neutron/ovn as is, and that we need to
refine our upgrade procedures.
-
## Perceived Severity
A few existing resources are now unusable for our users, but new resources are unaffected.
Given that our users expect to rely on existing resources, I'd put this issue as a blocker, but to the community it might sound like a High priority instead.
-
## Expectations
I hope to learn what we may have done wrong if our procedure is the
cause, and learn how I can fix the current situation manually, to
unblock our users
-
I'll stay available on this BugReport to provide any additional insight
I can. Hopefully this initial version provides enough information.
** Description changed:
## General description:
After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
Antelope to Caracal(2024.1), we found ourselves with a few resources
stuck with somehow bad states in neutron. More specifically, these
resources are in an invalid state, and can no longer be fixed through
the usual APIs.
From what I found out, it seems that there are some discrepancies
between Neutron and OVN databases, which the Mech Driver is unable to
fix automatically, and in turn, prevent the resources from being updated
by the neutron-server service.
## Context on the deployment & bug
Openstack version: 2024.1
Control-Plane Deployment: over Kubernetes, using community helm charts
Compute deployment: initramfs-booted (do not keep state between two boots, except the compute_id which is laid down by custom services upon initializing)
It may be important to note that due to our deployment method for the
compute nodes, our upgrade procedure for these is a simple reboot, with
required steps to disable the node in Nova. We currently do not handle
neutron's agent, which might mean that the operation is brutal from
neutron/ovn's point of view, akin to an unexpected server crash.
## Traces and useful outputs:
Initially, a user reported the inability to update a router's routes,
due to the service answering with an HTTP 500. Looking at neutron-server
logs, nothing useful was shown about the error, apart from the access
log line.
While looking at the logs from all neutron services, I ended up finding
a bunch of logs similar to the following (always the same KeyError, on
various router_ports) in the neutron-server logs:
- ```
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports): KeyError: 'neutron:provnet-network-type'
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most recent call last):
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 377, in check_for_inconsistencies
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance self._fix_create_update(admin_context, row)
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 286, in _fix_create_update
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance res_map['ovn_update'](context, n_obj)
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1878, in update_router_port
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance self._update_lrouter_port(context, port, if_exists=if_exists,
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1863, in _update_lrouter_port
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance options=self._gen_router_port_options(port),
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1701, in _gen_router_port_options
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError: 'neutron:provnet-network-type'
- ```
- I ended up finding that one of these errors was related to the resource
- update issue I had been called for, by checking out the northbound OVS
- database:
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports): KeyError: 'neutron:provnet-network-type'
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most recent call last):
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 377, in check_for_inconsistencies
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance self._fix_create_update(admin_context, row)
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 286, in _fix_create_update
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance res_map['ovn_update'](context, n_obj)
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1878, in update_router_port
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance self._update_lrouter_port(context, port, if_exists=if_exists,
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1863, in _update_lrouter_port
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance options=self._gen_router_port_options(port),
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1701, in _gen_router_port_options
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError: 'neutron:provnet-network-type'
- ```
- ...
- switch fbb1b26a-b915-454e-81e8-6600e1a70811 (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
- ...
- port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- type: router
- router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- ...
- router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20 (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
- port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
- mac: "REDACTED"
- networks: ["REDACTED"]
- port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- mac: "REDACTED"
- networks: ["REDACTED"]
- gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
- nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
- external ip: "REDACTED"
- logical ip: "REDACTED"
- type: "snat"
- ```
- Here, one of the 5 computes from the router is one that has been
- unavailable since the upgrade (COMPUTE5).
+ I ended up finding that one of these errors was related to the resource update issue I had been called for, by checking out the northbound OVS database:
+
+
+ > switch fbb1b26a-b915-454e-81e8-6600e1a70811 (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
+ > ...
+ > port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ > type: router
+ > router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ > ...
+ > router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20 (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
+ > port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
+ > mac: "REDACTED"
+ > networks: ["REDACTED"]
+ > port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ > mac: "REDACTED"
+ > networks: ["REDACTED"]
+ > gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
+ > nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
+ > external ip: "REDACTED"
+ > logical ip: "REDACTED"
+ > type: "snat"
+
+
+ Here, one of the 5 computes from the router is one that has been unavailable since the upgrade (COMPUTE5).
I wondered if that could be a factor, but then, I checked out the
southbound OVN Db for the port, to find out that it wasn't the active
chassis for the port:
- ```
- ...
- Chassis COMPUTE3
- hostname: COMPUTE3
- Encap geneve
- ip: "REDACTED"
- options: {csum="true"}
- Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- ...
- ```
+
+ > Chassis COMPUTE3
+ > hostname: COMPUTE3
+ > Encap geneve
+ > ip: "REDACTED"
+ > options: {csum="true"}
+ > Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+
All I could tell so far is that the neutron router that we're unable to update is somehow related to an OVH router_port that the neutron-server cannot automatically fix.
The
## Reproduction steps:
Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
what could have gone wrong. All I can say is that we noticed these
errors after upgrading our openstack deployment. It is possible that our
reboots might be too brutal for neutron/ovn as is, and that we need to
refine our upgrade procedures.
## Perceived Severity
A few existing resources are now unusable for our users, but new resources are unaffected.
Given that our users expect to rely on existing resources, I'd put this issue as a blocker, but to the community it might sound like a High priority instead.
## Expectations
I hope to learn what we may have done wrong if our procedure is the
cause, and learn how I can fix the current situation manually, to
unblock our users
I'll stay available on this BugReport to provide any additional insight
I can. Hopefully this initial version provides enough information.
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2111498
Title:
[OVN] Mechanism driver fails to fix router_ports: KeyError
'neutron:provnet-network-type'
Status in neutron:
New
Bug description:
## General description:
After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
Antelope to Caracal(2024.1), we found ourselves with a few resources
stuck with somehow bad states in neutron. More specifically, these
resources are in an invalid state, and can no longer be fixed through
the usual APIs.
From what I found out, it seems that there are some discrepancies
between Neutron and OVN databases, which the Mech Driver is unable to
fix automatically, and in turn, prevent the resources from being
updated by the neutron-server service.
## Context on the deployment & bug
Openstack version: 2024.1
Control-Plane Deployment: over Kubernetes, using community helm charts
Compute deployment: initramfs-booted (do not keep state between two boots, except the compute_id which is laid down by custom services upon initializing)
It may be important to note that due to our deployment method for the
compute nodes, our upgrade procedure for these is a simple reboot,
with required steps to disable the node in Nova. We currently do not
handle neutron's agent, which might mean that the operation is brutal
from neutron/ovn's point of view, akin to an unexpected server crash.
## Traces and useful outputs:
Initially, a user reported the inability to update a router's routes,
due to the service answering with an HTTP 500. Looking at neutron-
server logs, nothing useful was shown about the error, apart from the
access log line.
While looking at the logs from all neutron services, I ended up
finding a bunch of logs similar to the following (always the same
KeyError, on various router_ports) in the neutron-server logs:
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports): KeyError: 'neutron:provnet-network-type'
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most recent call last):
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 377, in check_for_inconsistencies
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance self._fix_create_update(admin_context, row)
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 286, in _fix_create_update
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance res_map['ovn_update'](context, n_obj)
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1878, in update_router_port
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance self._update_lrouter_port(context, port, if_exists=if_exists,
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1863, in _update_lrouter_port
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance options=self._gen_router_port_options(port),
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1701, in _gen_router_port_options
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError: 'neutron:provnet-network-type'
I ended up finding that one of these errors was related to the resource update issue I had been called for, by checking out the northbound OVS database:
> switch fbb1b26a-b915-454e-81e8-6600e1a70811 (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
> ...
> port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> type: router
> router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> ...
> router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20 (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
> port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
> mac: "REDACTED"
> networks: ["REDACTED"]
> port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> mac: "REDACTED"
> networks: ["REDACTED"]
> gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
> nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
> external ip: "REDACTED"
> logical ip: "REDACTED"
> type: "snat"
Here, one of the 5 computes from the router is one that has been unavailable since the upgrade (COMPUTE5).
I wondered if that could be a factor, but then, I checked out the
southbound OVN Db for the port, to find out that it wasn't the active
chassis for the port:
> Chassis COMPUTE3
> hostname: COMPUTE3
> Encap geneve
> ip: "REDACTED"
> options: {csum="true"}
> Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
All I could tell so far is that the neutron router that we're unable to update is somehow related to an OVH router_port that the neutron-server cannot automatically fix.
The
## Reproduction steps:
Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
what could have gone wrong. All I can say is that we noticed these
errors after upgrading our openstack deployment. It is possible that
our reboots might be too brutal for neutron/ovn as is, and that we
need to refine our upgrade procedures.
## Perceived Severity
A few existing resources are now unusable for our users, but new resources are unaffected.
Given that our users expect to rely on existing resources, I'd put this issue as a blocker, but to the community it might sound like a High priority instead.
## Expectations
I hope to learn what we may have done wrong if our procedure is the
cause, and learn how I can fix the current situation manually, to
unblock our users
I'll stay available on this BugReport to provide any additional
insight I can. Hopefully this initial version provides enough
information.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2111498/+subscriptions