← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2078846] [NEW] OVN fails to transition router port to ACTIVE, claims L2 provisioning block is still present

 

Public bug reported:

This happened in gate:
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1df/927631/1/check/neutron-
ovn-grenade-multinode/1dfe841/testr_results.html

>From tempest perspective, the failure looks like neutron API timed out.

When I inspect one of the failed requests, (both create_routers), I can
see that the request is stuck for more than a minute, then returns 201,
but it's too late for tempest.

In neutron-server log, we see that a router port is created:

```
Aug 30 18:47:37.411410 np0038347053 neutron-server[151218]: DEBUG neutron.scheduler.l3_ovn_scheduler [None req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae None None] Gateway lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab scheduled on chassis ['9cda8b5d-48b4-470c-9200-595fc41cc4b0'] {{(pid=151218) _schedule_gateway /opt/stack/new/neutron/neutron/scheduler/l3_ovn_scheduler.py:90}}
Aug 30 18:47:37.411841 np0038347053 neutron-server[151218]: INFO neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.commands [None req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae None None] Schedule LRP lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab on gateway 9cda8b5d-48b4-470c-9200-595fc41cc4b0 with priority 1
```

ovn-controller claims the port, as expected:


```
2024-08-30T18:47:37.424Z|00124|binding|INFO|Claiming lport cr-lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab for this chassis.
2024-08-30T18:47:37.424Z|00125|binding|INFO|cr-lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab: Claiming fa:16:3e:6c:61:ec 172.24.5.195/24 2001:db8::378/64
```

neutron-server then receives the OVN event about the claim:

```
Aug 30 18:47:37.446442 np0038347053 neutron-server[151218]: DEBUG ovsdbapp.backend.ovs_idl.event [None req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae None None] Matched UPDATE: LogicalSwitchPortUpdateUpEvent(events=('update',), table='Logical_Switch_Port', conditions=None, old_conditions=None), priority=20 to row=Logical_Switch_Port(up=[True], name=d368456e-fb18-44b6-8b1b-bfe5b17efcab, external_ids={'neutron:cidrs': '172.24.5.195/24 2001:db8::378/64', 'neutron:device_id': '7094f59f-6a10-4481-aaa8-55d8fb688b35', 'neutron:device_owner': 'network:router_gateway', 'neutron:mtu': '', 'neutron:network_name': 'neutron-90a590ed-ede0-48d0-92d2-786c7f66b733', 'neutron:port_capabilities': '', 'neutron:port_name': '', 'neutron:project_id': '', 'neutron:revision_number': '1', 'neutron:security_group_ids': '', 'neutron:subnet_pool_addr_scope4': '', 'neutron:subnet_pool_addr_scope6': '', 'neutron:vnic_type': 'normal'}, ha_chassis_group=[], options={'exclude-lb-vips-from-garp': 'true', 'nat-addresses': 'router', 'router-port': 'lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab'}, parent_name=[], enabled=[True], type=router, dhcpv6_options=[], tag_request=[], port_security=[], addresses=['router'], dynamic_addresses=[], dhcpv4_options=[], tag=[]) old=Logical_Switch_Port(up=[False]) {{(pid=151218) matches /usr/local/lib/python3.10/dist-packages/ovsdbapp/backend/ovs_idl/event.py:55}}
Aug 30 18:47:37.447796 np0038347053 neutron-server[151218]: INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] OVN reports status up for port: d368456e-fb18-44b6-8b1b-bfe5b17efcab
```

It proceeds with provisioning.

```
Aug 30 18:47:37.457253 np0038347053 neutron-server[151218]: DEBUG neutron.db.provisioning_blocks [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] Provisioning complete for port d368456e-fb18-44b6-8b1b-bfe5b17efcab triggered by entity L2. {{(pid=151218) provisioning_complete /opt/stack/new/neutron/neutron/db/provisioning_blocks.py:144}}
...
Aug 30 18:47:37.457763 np0038347053 neutron-server[151218]: INFO neutron.plugins.ml2.plugin [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] Attempt 1 to provision port d368456e-fb18-44b6-8b1b-bfe5b17efcab
```

Looking at the code, the first message logs when provisioning_complete
is executed. But: if a block would be actually removed from db, we would
also see:


```
        if remove_provisioning_component(context, object_id, object_type,
                                         entity, standard_attr_id):
            LOG.debug("Provisioning for %(otype)s %(oid)s completed by entity "
                      "%(entity)s.", log_dict)

```

And we don't. Was the provisioning object not in the database?..


Then later, it fails to switch the port in neutron db to ACTIVE because:

```
Aug 30 18:47:37.499272 np0038347053 neutron-server[151218]: DEBUG neutron.plugins.ml2.plugin [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] Port d368456e-fb18-44b6-8b1b-bfe5b17efcab cannot update to ACTIVE because it is not bound. {{(pid=151218) _port_provisioned /opt/stack/new/neutron/neutron/plugins/ml2/plugin.py:375}}
```

I think this is fine, since it's a router port. The provisioning hook -
_port_provisioned - then exits (since it's not a compute owned port).


Later, the driver complains that the port will be blocked by L2 block:

```
Aug 30 18:47:37.771402 np0038347053 neutron-server[151219]: DEBUG neutron.db.provisioning_blocks [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 None None] Transition to ACTIVE for port object d368456e-fb18-44b6-8b1b-bfe5b17efcab will not be triggered until provisioned by entity L2. {{(pid=151219) add_provisioning_component /opt/stack/new/neutron/neutron/db/provisioning_blocks.py:74}}
```

This message comes from add_provisioning_component. Why is the block
being added right when we were considering removing it because of l2
agent claim?

The message just above suggests it's from an update call on the port:

```
Aug 30 18:47:37.763123 np0038347053 neutron-server[151219]: DEBUG neutron_lib.callbacks.manager [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 None None] Publish callbacks ['neutron.services.qos.qos_plugin.QoSPlugin._validate_update_port_callback-263536017'] for port (d368456e-fb18-44b6-8b1b-bfe5b17efcab), precommit_update {{(pid=151219) _notify_loop /usr/local/lib/python3.10/dist-packages/neutron_lib/callbacks/manager.py:184}}
```

But I couldn't figure out immediately which update it was.


A bit more later:

```
Aug 30 18:47:52.291827 np0038347053 neutron-server[151218]: WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovn_client [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] No hosting information found for port d368456e-fb18-44b6-8b1b-bfe5b17efcab: RuntimeError: No hosting information found for port d368456e-fb18-44b6-8b1b-bfe5b17efcab
```

Eventually, almost a minute since the requests started:

```
Aug 30 18:48:35.235725 np0038347053 neutron-server[151218]: INFO neutron.db.ovn_revision_numbers_db [None req-95b12168-c978-4195-b1cf-658e50efc7bd admin admin] Successfully bumped revision number for resource d368456e-fb18-44b6-8b1b-bfe5b17efcab (type: router_ports) to 1
...
Aug 30 18:48:36.048995 np0038347053 neutron-server[151219]: DEBUG neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 None None] Attempting to bind port d368456e-fb18-44b6-8b1b-bfe5b17efcab on host np0038347053 for vnic_type normal with profile  {{(pid=151219) bind_port /opt/stack/new/neutron/neutron/plugins/ml2/managers.py:810}}
Aug 30 18:48:36.049143 np0038347053 neutron-server[151219]: DEBUG neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 None None] Attempting to bind port d368456e-fb18-44b6-8b1b-bfe5b17efcab by drivers ovn,logger on host np0038347053 at level 0 using segments [{'id': '0f624452-4286-4ce4-9bc0-aa5c3726df31', 'network_type': 'flat', 'physical_network': 'public', 'segmentation_id': None, 'network_id': '90a590ed-ede0-48d0-92d2-786c7f66b733'}] {{(pid=151219) _bind_port_level /opt/stack/new/neutron/neutron/plugins/ml2/managers.py:835}}
...
Aug 30 18:48:36.052615 np0038347053 neutron-server[151219]: DEBUG neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 None None] Bound port: d368456e-fb18-44b6-8b1b-bfe5b17efcab, host: np0038347053, vif_type: ovs, vif_details: {"port_filter": true, "connectivity": "l2", "bridge_name": "br-int", "datapath_type": "system"}, binding_levels: [{'bound_driver': 'ovn', 'bound_segment': {'id': '0f624452-4286-4ce4-9bc0-aa5c3726df31', 'network_type': 'flat', 'physical_network': 'public', 'segmentation_id': None, 'network_id': '90a590ed-ede0-48d0-92d2-786c7f66b733'}}] {{(pid=151219) _bind_port_level /opt/stack/new/neutron/neutron/plugins/ml2/managers.py:947}}
```


---


This happened in grenade job but I don't think it's grenade specific.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: gate-failure ovn

** Tags added: ovn

** Tags added: gate-failure

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2078846

Title:
  OVN fails to transition router port to ACTIVE, claims L2 provisioning
  block is still present

Status in neutron:
  New

Bug description:
  This happened in gate:
  https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1df/927631/1/check/neutron-
  ovn-grenade-multinode/1dfe841/testr_results.html

  From tempest perspective, the failure looks like neutron API timed
  out.

  When I inspect one of the failed requests, (both create_routers), I
  can see that the request is stuck for more than a minute, then returns
  201, but it's too late for tempest.

  In neutron-server log, we see that a router port is created:

  ```
  Aug 30 18:47:37.411410 np0038347053 neutron-server[151218]: DEBUG neutron.scheduler.l3_ovn_scheduler [None req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae None None] Gateway lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab scheduled on chassis ['9cda8b5d-48b4-470c-9200-595fc41cc4b0'] {{(pid=151218) _schedule_gateway /opt/stack/new/neutron/neutron/scheduler/l3_ovn_scheduler.py:90}}
  Aug 30 18:47:37.411841 np0038347053 neutron-server[151218]: INFO neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.commands [None req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae None None] Schedule LRP lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab on gateway 9cda8b5d-48b4-470c-9200-595fc41cc4b0 with priority 1
  ```

  ovn-controller claims the port, as expected:

  
  ```
  2024-08-30T18:47:37.424Z|00124|binding|INFO|Claiming lport cr-lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab for this chassis.
  2024-08-30T18:47:37.424Z|00125|binding|INFO|cr-lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab: Claiming fa:16:3e:6c:61:ec 172.24.5.195/24 2001:db8::378/64
  ```

  neutron-server then receives the OVN event about the claim:

  ```
  Aug 30 18:47:37.446442 np0038347053 neutron-server[151218]: DEBUG ovsdbapp.backend.ovs_idl.event [None req-716bd7b7-04a7-4e8a-a1a5-9d6f66e0e6ae None None] Matched UPDATE: LogicalSwitchPortUpdateUpEvent(events=('update',), table='Logical_Switch_Port', conditions=None, old_conditions=None), priority=20 to row=Logical_Switch_Port(up=[True], name=d368456e-fb18-44b6-8b1b-bfe5b17efcab, external_ids={'neutron:cidrs': '172.24.5.195/24 2001:db8::378/64', 'neutron:device_id': '7094f59f-6a10-4481-aaa8-55d8fb688b35', 'neutron:device_owner': 'network:router_gateway', 'neutron:mtu': '', 'neutron:network_name': 'neutron-90a590ed-ede0-48d0-92d2-786c7f66b733', 'neutron:port_capabilities': '', 'neutron:port_name': '', 'neutron:project_id': '', 'neutron:revision_number': '1', 'neutron:security_group_ids': '', 'neutron:subnet_pool_addr_scope4': '', 'neutron:subnet_pool_addr_scope6': '', 'neutron:vnic_type': 'normal'}, ha_chassis_group=[], options={'exclude-lb-vips-from-garp': 'true', 'nat-addresses': 'router', 'router-port': 'lrp-d368456e-fb18-44b6-8b1b-bfe5b17efcab'}, parent_name=[], enabled=[True], type=router, dhcpv6_options=[], tag_request=[], port_security=[], addresses=['router'], dynamic_addresses=[], dhcpv4_options=[], tag=[]) old=Logical_Switch_Port(up=[False]) {{(pid=151218) matches /usr/local/lib/python3.10/dist-packages/ovsdbapp/backend/ovs_idl/event.py:55}}
  Aug 30 18:47:37.447796 np0038347053 neutron-server[151218]: INFO neutron.plugins.ml2.drivers.ovn.mech_driver.mech_driver [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] OVN reports status up for port: d368456e-fb18-44b6-8b1b-bfe5b17efcab
  ```

  It proceeds with provisioning.

  ```
  Aug 30 18:47:37.457253 np0038347053 neutron-server[151218]: DEBUG neutron.db.provisioning_blocks [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] Provisioning complete for port d368456e-fb18-44b6-8b1b-bfe5b17efcab triggered by entity L2. {{(pid=151218) provisioning_complete /opt/stack/new/neutron/neutron/db/provisioning_blocks.py:144}}
  ...
  Aug 30 18:47:37.457763 np0038347053 neutron-server[151218]: INFO neutron.plugins.ml2.plugin [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] Attempt 1 to provision port d368456e-fb18-44b6-8b1b-bfe5b17efcab
  ```

  Looking at the code, the first message logs when provisioning_complete
  is executed. But: if a block would be actually removed from db, we
  would also see:

  
  ```
          if remove_provisioning_component(context, object_id, object_type,
                                           entity, standard_attr_id):
              LOG.debug("Provisioning for %(otype)s %(oid)s completed by entity "
                        "%(entity)s.", log_dict)

  ```

  And we don't. Was the provisioning object not in the database?..

  
  Then later, it fails to switch the port in neutron db to ACTIVE because:

  ```
  Aug 30 18:47:37.499272 np0038347053 neutron-server[151218]: DEBUG neutron.plugins.ml2.plugin [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] Port d368456e-fb18-44b6-8b1b-bfe5b17efcab cannot update to ACTIVE because it is not bound. {{(pid=151218) _port_provisioned /opt/stack/new/neutron/neutron/plugins/ml2/plugin.py:375}}
  ```

  I think this is fine, since it's a router port. The provisioning hook
  - _port_provisioned - then exits (since it's not a compute owned
  port).

  
  Later, the driver complains that the port will be blocked by L2 block:

  ```
  Aug 30 18:47:37.771402 np0038347053 neutron-server[151219]: DEBUG neutron.db.provisioning_blocks [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 None None] Transition to ACTIVE for port object d368456e-fb18-44b6-8b1b-bfe5b17efcab will not be triggered until provisioned by entity L2. {{(pid=151219) add_provisioning_component /opt/stack/new/neutron/neutron/db/provisioning_blocks.py:74}}
  ```

  This message comes from add_provisioning_component. Why is the block
  being added right when we were considering removing it because of l2
  agent claim?

  The message just above suggests it's from an update call on the port:

  ```
  Aug 30 18:47:37.763123 np0038347053 neutron-server[151219]: DEBUG neutron_lib.callbacks.manager [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 None None] Publish callbacks ['neutron.services.qos.qos_plugin.QoSPlugin._validate_update_port_callback-263536017'] for port (d368456e-fb18-44b6-8b1b-bfe5b17efcab), precommit_update {{(pid=151219) _notify_loop /usr/local/lib/python3.10/dist-packages/neutron_lib/callbacks/manager.py:184}}
  ```

  But I couldn't figure out immediately which update it was.

  
  A bit more later:

  ```
  Aug 30 18:47:52.291827 np0038347053 neutron-server[151218]: WARNING neutron.plugins.ml2.drivers.ovn.mech_driver.ovsdb.ovn_client [None req-a3babd5e-e8b3-4826-8588-9e30fc718f49 None None] No hosting information found for port d368456e-fb18-44b6-8b1b-bfe5b17efcab: RuntimeError: No hosting information found for port d368456e-fb18-44b6-8b1b-bfe5b17efcab
  ```

  Eventually, almost a minute since the requests started:

  ```
  Aug 30 18:48:35.235725 np0038347053 neutron-server[151218]: INFO neutron.db.ovn_revision_numbers_db [None req-95b12168-c978-4195-b1cf-658e50efc7bd admin admin] Successfully bumped revision number for resource d368456e-fb18-44b6-8b1b-bfe5b17efcab (type: router_ports) to 1
  ...
  Aug 30 18:48:36.048995 np0038347053 neutron-server[151219]: DEBUG neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 None None] Attempting to bind port d368456e-fb18-44b6-8b1b-bfe5b17efcab on host np0038347053 for vnic_type normal with profile  {{(pid=151219) bind_port /opt/stack/new/neutron/neutron/plugins/ml2/managers.py:810}}
  Aug 30 18:48:36.049143 np0038347053 neutron-server[151219]: DEBUG neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 None None] Attempting to bind port d368456e-fb18-44b6-8b1b-bfe5b17efcab by drivers ovn,logger on host np0038347053 at level 0 using segments [{'id': '0f624452-4286-4ce4-9bc0-aa5c3726df31', 'network_type': 'flat', 'physical_network': 'public', 'segmentation_id': None, 'network_id': '90a590ed-ede0-48d0-92d2-786c7f66b733'}] {{(pid=151219) _bind_port_level /opt/stack/new/neutron/neutron/plugins/ml2/managers.py:835}}
  ...
  Aug 30 18:48:36.052615 np0038347053 neutron-server[151219]: DEBUG neutron.plugins.ml2.managers [None req-0a3bd173-d64e-4a84-8704-59063a3d0959 None None] Bound port: d368456e-fb18-44b6-8b1b-bfe5b17efcab, host: np0038347053, vif_type: ovs, vif_details: {"port_filter": true, "connectivity": "l2", "bridge_name": "br-int", "datapath_type": "system"}, binding_levels: [{'bound_driver': 'ovn', 'bound_segment': {'id': '0f624452-4286-4ce4-9bc0-aa5c3726df31', 'network_type': 'flat', 'physical_network': 'public', 'segmentation_id': None, 'network_id': '90a590ed-ede0-48d0-92d2-786c7f66b733'}}] {{(pid=151219) _bind_port_level /opt/stack/new/neutron/neutron/plugins/ml2/managers.py:947}}
  ```

  
  ---

  
  This happened in grenade job but I don't think it's grenade specific.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2078846/+subscriptions