← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2111498] [NEW] [OVN] Mechanism driver fails to fix router_ports: KeyError 'neutron:provnet-network-type'

 

Public bug reported:

## General description:

After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
Antelope to Caracal(2024.1), we found ourselves with a few resources
stuck with somehow bad states in neutron. More specifically, these
resources are in an invalid state, and can no longer be fixed through
the usual APIs.

>From what I found out, it seems that there are some discrepancies
between Neutron and OVN databases, which the Mech Driver is unable to
fix automatically, and in turn, prevent the resources from being updated
by the neutron-server service.

## Context on the deployment & bug

Openstack version: 2024.1
Control-Plane Deployment: over Kubernetes, using community helm charts
Compute deployment: initramfs-booted (do not keep state between two boots, except the compute_id which is laid down by custom services upon initializing)

It may be important to note that due to our deployment method for the
compute nodes, our upgrade procedure for these is a simple reboot, with
required steps to disable the node in Nova. We currently do not handle
neutron's agent, which might mean that the operation is brutal from
neutron/ovn's point of view, akin to an unexpected server crash.

## Traces and useful outputs:

Initially, a user reported the inability to update a router's routes,
due to the service answering with an HTTP 500. Looking at neutron-server
logs, nothing useful was shown about the error, apart from the access
log line.

While looking at the logs from all neutron services, I ended up finding
a bunch of logs similar to the following (always the same KeyError, on
various router_ports) in the neutron-server logs:


> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports): KeyError: 'neutron:provnet-network-type'
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most recent call last):
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 377, in check_for_inconsistencies
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     self._fix_create_update(admin_context, row)
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 286, in _fix_create_update
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     res_map['ovn_update'](context, n_obj)
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1878, in update_router_port
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     self._update_lrouter_port(context, port, if_exists=if_exists,
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1863, in _update_lrouter_port
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     options=self._gen_router_port_options(port),
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1701, in _gen_router_port_options
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
> 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError: 'neutron:provnet-network-type'


I ended up finding that one of these errors was related to the resource update issue I had been called for, by checking out the northbound OVS database:


> switch fbb1b26a-b915-454e-81e8-6600e1a70811 (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
> ...
>     port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
>         type: router
>         router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
> ...
> router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20 (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
>     port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
>         mac: "REDACTED"
>         networks: ["REDACTED"]
>     port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
>         mac: "REDACTED"
>         networks: ["REDACTED"]
>         gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
>     nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
>         external ip: "REDACTED"
>         logical ip: "REDACTED"
>         type: "snat"


Here, one of the 5 computes from the router is one that has been unavailable since the upgrade (COMPUTE5).

I wondered if that could be a factor, but then, I checked out the
southbound OVN Db for the port, to find out that it wasn't the active
chassis for the port:


> Chassis COMPUTE3
>     hostname: COMPUTE3
>     Encap geneve
>         ip: "REDACTED"
>         options: {csum="true"}
>     Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3


All I could tell so far is that the neutron router that we're unable to update is somehow related to an OVH router_port that the neutron-server cannot automatically fix.
The

## Reproduction steps:

Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
what could have gone wrong. All I can say is that we noticed these
errors after upgrading our openstack deployment. It is possible that our
reboots might be too brutal for neutron/ovn as is, and that we need to
refine our upgrade procedures.

## Perceived Severity

A few existing resources are now unusable for our users, but new resources are unaffected.
Given that our users expect to rely on existing resources, I'd put this issue as a blocker, but to the community it might sound like a High priority instead.

## Expectations

I hope to learn what we may have done wrong if our procedure is the
cause, and learn how I can fix the current situation manually, to
unblock our users

I'll stay available on this BugReport to provide any additional insight
I can. Hopefully this initial version provides enough information.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: ovn

** Description changed:

  ## General description:
  
  After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
  Antelope to Caracal(2024.1), we found ourselves with a few resources
  stuck with somehow bad states in neutron. More specifically, these
  resources are in an invalid state, and can no longer be fixed through
  the usual APIs.
  
  From what I found out, it seems that there are some discrepancies
  between Neutron and OVN databases, which the Mech Driver is unable to
  fix automatically, and in turn, prevent the resources from being updated
  by the neutron-server service.
  
- 
  ## Context on the deployment & bug
  
  Openstack version: 2024.1
  Control-Plane Deployment: over Kubernetes, using community helm charts
  Compute deployment: initramfs-booted (do not keep state between two boots, except the compute_id which is laid down by custom services upon initializing)
  
  It may be important to note that due to our deployment method for the
  compute nodes, our upgrade procedure for these is a simple reboot, with
  required steps to disable the node in Nova. We currently do not handle
  neutron's agent, which might mean that the operation is brutal from
  neutron/ovn's point of view, akin to an unexpected server crash.
- 
  
  ## Traces and useful outputs:
  
  Initially, a user reported the inability to update a router's routes,
  due to the service answering with an HTTP 500. Looking at neutron-server
  logs, nothing useful was shown about the error, apart from the access
  log line.
  
  While looking at the logs from all neutron services, I ended up finding
  a bunch of logs similar to the following (always the same KeyError, on
  various router_ports) in the neutron-server logs:
  
  ```
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports): KeyError: 'neutron:provnet-network-type'
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most recent call last):
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 377, in check_for_inconsistencies
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     self._fix_create_update(admin_context, row)
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 286, in _fix_create_update
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     res_map['ovn_update'](context, n_obj)
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1878, in update_router_port
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     self._update_lrouter_port(context, port, if_exists=if_exists,
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1863, in _update_lrouter_port
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     options=self._gen_router_port_options(port),
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1701, in _gen_router_port_options
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
  2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError: 'neutron:provnet-network-type'
  ```
  
  I ended up finding that one of these errors was related to the resource
  update issue I had been called for, by checking out the northbound OVS
  database:
  
  ```
  ...
  switch fbb1b26a-b915-454e-81e8-6600e1a70811 (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
  ...
-     port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
-         type: router
-         router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+     port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+         type: router
+         router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
  ...
  router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20 (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
-     port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
-         mac: "REDACTED"
-         networks: ["REDACTED"]
-     port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
-         mac: "REDACTED"
-         networks: ["REDACTED"]
-         gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
-     nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
-         external ip: "REDACTED"
-         logical ip: "REDACTED"
-         type: "snat"
+     port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
+         mac: "REDACTED"
+         networks: ["REDACTED"]
+     port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+         mac: "REDACTED"
+         networks: ["REDACTED"]
+         gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
+     nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
+         external ip: "REDACTED"
+         logical ip: "REDACTED"
+         type: "snat"
  ```
  
  Here, one of the 5 computes from the router is one that has been
  unavailable since the upgrade (COMPUTE5).
  
  I wondered if that could be a factor, but then, I checked out the
  southbound OVN Db for the port, to find out that it wasn't the active
  chassis for the port:
  
  ```
  ...
  Chassis COMPUTE3
-     hostname: COMPUTE3
-     Encap geneve
-         ip: "REDACTED"
-         options: {csum="true"}
-     Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+     hostname: COMPUTE3
+     Encap geneve
+         ip: "REDACTED"
+         options: {csum="true"}
+     Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
  ...
  ```
  
  All I could tell so far is that the neutron router that we're unable to update is somehow related to an OVH router_port that the neutron-server cannot automatically fix.
- The 
- 
+ The
  
  ## Reproduction steps:
  
  Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
  what could have gone wrong. All I can say is that we noticed these
  errors after upgrading our openstack deployment. It is possible that our
  reboots might be too brutal for neutron/ovn as is, and that we need to
  refine our upgrade procedures.
  
- 
  ## Perceived Severity
  
  A few existing resources are now unusable for our users, but new resources are unaffected.
  Given that our users expect to rely on existing resources, I'd put this issue as a blocker, but to the community it might sound like a High priority instead.
- 
  
  ## Expectations
  
  I hope to learn what we may have done wrong if our procedure is the
  cause, and learn how I can fix the current situation manually, to
  unblock our users
  
- 
  I'll stay available on this BugReport to provide any additional insight
  I can. Hopefully this initial version provides enough information.

** Description changed:

  ## General description:
  
  After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
  Antelope to Caracal(2024.1), we found ourselves with a few resources
  stuck with somehow bad states in neutron. More specifically, these
  resources are in an invalid state, and can no longer be fixed through
  the usual APIs.
  
  From what I found out, it seems that there are some discrepancies
  between Neutron and OVN databases, which the Mech Driver is unable to
  fix automatically, and in turn, prevent the resources from being updated
  by the neutron-server service.
  
  ## Context on the deployment & bug
  
  Openstack version: 2024.1
  Control-Plane Deployment: over Kubernetes, using community helm charts
  Compute deployment: initramfs-booted (do not keep state between two boots, except the compute_id which is laid down by custom services upon initializing)
  
  It may be important to note that due to our deployment method for the
  compute nodes, our upgrade procedure for these is a simple reboot, with
  required steps to disable the node in Nova. We currently do not handle
  neutron's agent, which might mean that the operation is brutal from
  neutron/ovn's point of view, akin to an unexpected server crash.
  
  ## Traces and useful outputs:
  
  Initially, a user reported the inability to update a router's routes,
  due to the service answering with an HTTP 500. Looking at neutron-server
  logs, nothing useful was shown about the error, apart from the access
  log line.
  
  While looking at the logs from all neutron services, I ended up finding
  a bunch of logs similar to the following (always the same KeyError, on
  various router_ports) in the neutron-server logs:
  
- ```
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports): KeyError: 'neutron:provnet-network-type'
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most recent call last):
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 377, in check_for_inconsistencies
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     self._fix_create_update(admin_context, row)
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 286, in _fix_create_update
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     res_map['ovn_update'](context, n_obj)
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1878, in update_router_port
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     self._update_lrouter_port(context, port, if_exists=if_exists,
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1863, in _update_lrouter_port
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     options=self._gen_router_port_options(port),
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1701, in _gen_router_port_options
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
- 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError: 'neutron:provnet-network-type'
- ```
  
- I ended up finding that one of these errors was related to the resource
- update issue I had been called for, by checking out the northbound OVS
- database:
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports): KeyError: 'neutron:provnet-network-type'
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most recent call last):
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 377, in check_for_inconsistencies
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     self._fix_create_update(admin_context, row)
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 286, in _fix_create_update
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     res_map['ovn_update'](context, n_obj)
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1878, in update_router_port
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     self._update_lrouter_port(context, port, if_exists=if_exists,
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1863, in _update_lrouter_port
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     options=self._gen_router_port_options(port),
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1701, in _gen_router_port_options
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
+ > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError: 'neutron:provnet-network-type'
  
- ```
- ...
- switch fbb1b26a-b915-454e-81e8-6600e1a70811 (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
- ...
-     port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
-         type: router
-         router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- ...
- router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20 (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
-     port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
-         mac: "REDACTED"
-         networks: ["REDACTED"]
-     port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
-         mac: "REDACTED"
-         networks: ["REDACTED"]
-         gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
-     nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
-         external ip: "REDACTED"
-         logical ip: "REDACTED"
-         type: "snat"
- ```
  
- Here, one of the 5 computes from the router is one that has been
- unavailable since the upgrade (COMPUTE5).
+ I ended up finding that one of these errors was related to the resource update issue I had been called for, by checking out the northbound OVS database:
+ 
+ 
+ > switch fbb1b26a-b915-454e-81e8-6600e1a70811 (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
+ > ...
+ >     port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ >         type: router
+ >         router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ > ...
+ > router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20 (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
+ >     port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
+ >         mac: "REDACTED"
+ >         networks: ["REDACTED"]
+ >     port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ >         mac: "REDACTED"
+ >         networks: ["REDACTED"]
+ >         gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
+ >     nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
+ >         external ip: "REDACTED"
+ >         logical ip: "REDACTED"
+ >         type: "snat"
+ 
+ 
+ Here, one of the 5 computes from the router is one that has been unavailable since the upgrade (COMPUTE5).
  
  I wondered if that could be a factor, but then, I checked out the
  southbound OVN Db for the port, to find out that it wasn't the active
  chassis for the port:
  
- ```
- ...
- Chassis COMPUTE3
-     hostname: COMPUTE3
-     Encap geneve
-         ip: "REDACTED"
-         options: {csum="true"}
-     Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
- ...
- ```
+ 
+ > Chassis COMPUTE3
+ >     hostname: COMPUTE3
+ >     Encap geneve
+ >         ip: "REDACTED"
+ >         options: {csum="true"}
+ >     Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
+ 
  
  All I could tell so far is that the neutron router that we're unable to update is somehow related to an OVH router_port that the neutron-server cannot automatically fix.
  The
  
  ## Reproduction steps:
  
  Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
  what could have gone wrong. All I can say is that we noticed these
  errors after upgrading our openstack deployment. It is possible that our
  reboots might be too brutal for neutron/ovn as is, and that we need to
  refine our upgrade procedures.
  
  ## Perceived Severity
  
  A few existing resources are now unusable for our users, but new resources are unaffected.
  Given that our users expect to rely on existing resources, I'd put this issue as a blocker, but to the community it might sound like a High priority instead.
  
  ## Expectations
  
  I hope to learn what we may have done wrong if our procedure is the
  cause, and learn how I can fix the current situation manually, to
  unblock our users
  
  I'll stay available on this BugReport to provide any additional insight
  I can. Hopefully this initial version provides enough information.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2111498

Title:
  [OVN] Mechanism driver fails to fix router_ports: KeyError
  'neutron:provnet-network-type'

Status in neutron:
  New

Bug description:
  ## General description:

  After an openstack upgrade from Zed(2022.1) to Antelope(2023.1) then
  Antelope to Caracal(2024.1), we found ourselves with a few resources
  stuck with somehow bad states in neutron. More specifically, these
  resources are in an invalid state, and can no longer be fixed through
  the usual APIs.

  From what I found out, it seems that there are some discrepancies
  between Neutron and OVN databases, which the Mech Driver is unable to
  fix automatically, and in turn, prevent the resources from being
  updated by the neutron-server service.

  ## Context on the deployment & bug

  Openstack version: 2024.1
  Control-Plane Deployment: over Kubernetes, using community helm charts
  Compute deployment: initramfs-booted (do not keep state between two boots, except the compute_id which is laid down by custom services upon initializing)

  It may be important to note that due to our deployment method for the
  compute nodes, our upgrade procedure for these is a simple reboot,
  with required steps to disable the node in Nova. We currently do not
  handle neutron's agent, which might mean that the operation is brutal
  from neutron/ovn's point of view, akin to an unexpected server crash.

  ## Traces and useful outputs:

  Initially, a user reported the inability to update a router's routes,
  due to the service answering with an HTTP 500. Looking at neutron-
  server logs, nothing useful was shown about the error, apart from the
  access log line.

  While looking at the logs from all neutron services, I ended up
  finding a bunch of logs similar to the following (always the same
  KeyError, on various router_ports) in the neutron-server logs:

  
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance [None req-0e6a2a8f-cdbe-40a5-8d27-a224690040c5 - - - - - -] Maintenance task: Failed to fix resource 9f036aca-78e1-4246-97f9-b98c5cd48011 (type: router_ports): KeyError: 'neutron:provnet-network-type'
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance Traceback (most recent call last):
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 377, in check_for_inconsistencies
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     self._fix_create_update(admin_context, row)
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/maintenance.py", line 286, in _fix_create_update
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     res_map['ovn_update'](context, n_obj)
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1878, in update_router_port
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     self._update_lrouter_port(context, port, if_exists=if_exists,
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1863, in _update_lrouter_port
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     options=self._gen_router_port_options(port),
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py", line 1701, in _gen_router_port_options
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance     network_type = ls.external_ids[ovn_const.OVN_NETTYPE_EXT_ID_KEY]
  > 2025-05-22 11:47:06.959 16 ERROR neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.maintenance KeyError: 'neutron:provnet-network-type'

  
  I ended up finding that one of these errors was related to the resource update issue I had been called for, by checking out the northbound OVS database:

  
  > switch fbb1b26a-b915-454e-81e8-6600e1a70811 (neutron-31727984-5c9f-42fd-94f9-d0ccd98f19ba) (aka public)
  > ...
  >     port 550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
  >         type: router
  >         router-port: lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
  > ...
  > router 91c7b5e2-ca25-445a-93f3-9b80c4a43d20 (neutron-575d7f62-f79e-40e7-b496-d81c9f525b78) (aka prod)
  >     port lrp-526e9d00-ccbd-4548-843d-6e84634e50b7
  >         mac: "REDACTED"
  >         networks: ["REDACTED"]
  >     port lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3
  >         mac: "REDACTED"
  >         networks: ["REDACTED"]
  >         gateway chassis: [COMPUTE1 COMPUTE2 COMPUTE3 COMPUTE4 COMPUTE5]
  >     nat 1c2d6bcd-bcf4-441c-9d3d-66cad4b5e6c3
  >         external ip: "REDACTED"
  >         logical ip: "REDACTED"
  >         type: "snat"

  
  Here, one of the 5 computes from the router is one that has been unavailable since the upgrade (COMPUTE5).

  I wondered if that could be a factor, but then, I checked out the
  southbound OVN Db for the port, to find out that it wasn't the active
  chassis for the port:

  
  > Chassis COMPUTE3
  >     hostname: COMPUTE3
  >     Encap geneve
  >         ip: "REDACTED"
  >         options: {csum="true"}
  >     Port_Binding cr-lrp-550b72c8-f0b7-4b1b-b20e-97fe7a8d4cc3

  
  All I could tell so far is that the neutron router that we're unable to update is somehow related to an OVH router_port that the neutron-server cannot automatically fix.
  The

  ## Reproduction steps:

  Sadly, I'm not knowledgeable enough on neutron/ovn to be able to tell
  what could have gone wrong. All I can say is that we noticed these
  errors after upgrading our openstack deployment. It is possible that
  our reboots might be too brutal for neutron/ovn as is, and that we
  need to refine our upgrade procedures.

  ## Perceived Severity

  A few existing resources are now unusable for our users, but new resources are unaffected.
  Given that our users expect to rely on existing resources, I'd put this issue as a blocker, but to the community it might sound like a High priority instead.

  ## Expectations

  I hope to learn what we may have done wrong if our procedure is the
  cause, and learn how I can fix the current situation manually, to
  unblock our users

  I'll stay available on this BugReport to provide any additional
  insight I can. Hopefully this initial version provides enough
  information.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2111498/+subscriptions