← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1969354] Re: ovn-controller don't update new flows

 

It looks more like a configuration or scaling problem on your
environment and not a bug in OVN controller. If it's disconnected then
it can't process new flows.

Question:
1. ovsdb disconnecting after 5s, how i can fix it?
yes, by default. you can increase it by running 'ovs-vsctl set open . external_ids:ovn-remote-probe-interval="60000"'

2. ovn-controller response OVNSB commit failed, force recompute next time. What is it?
It means ovn-controller attempted a write to the OVN DB but failed. It will recalculate all openflows on the next iteration.

3. ovn-controller log 100% CPU Usage. Looks like it only uses 1 cpu for processing. Maybe the this cpu of compute is being used by vm. How i can fix it? Is it possible for it to use more cpu?
No, ovn-controller is single process, single threaded.

4. Why live-migrate can't down ovn-controller? Is it a flow conflict? Is there any way to deal with this situation?
It looks like the whole environment is unstable, you should check southbound database if it's not overloaded.

** Changed in: neutron
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1969354

Title:
  ovn-controller don't update new flows

Status in neutron:
  Invalid

Bug description:
  Problem: OVN-Controller down, can't update new flows

  Description:
  - Running OpenStack Victoria, build by kolla-ansible
  - E.g we have 2 compute (A & B). ~ 70vm, > 80k flows per compute
  - I live migrate some vm from compute A to compute B
  - openstack network agent list tells ovn-controller on compute B down
  - On ovn-nb-db, i check `ovn-nbctl list NB_Global`. It response hv_cfg < nb_cfg
  - On ovn-sb-db, i check `ovn-sbctl list Chassis`. It response nb_cfg of compute A = nb_cfg of ovn_nb. But nb_cfg of compute B < nb_cfg of ovn_nb
  ==> Neutron response OVN-controller on compute B down

  Work arround:
  - Flush all flows
  - Restart ovn-controller

  Reproduce:
  - Can't reproduce with live migrate

  Logs:
  1. ovn-controller on compute B:
  2022-04-15T09:53:22.156Z|12616545|timeval|WARN|Unreasonably long 9077ms poll interval (8944ms user, 131ms system)
  2022-04-15T09:53:22.156Z|12616546|timeval|WARN|faults: 69006 minor, 0 major
  2022-04-15T09:53:22.156Z|12616547|timeval|WARN|disk: 0 reads, 8 writes
  2022-04-15T09:53:22.156Z|12616548|timeval|WARN|context switches: 0 voluntary, 19 involuntary
  2022-04-15T09:53:22.156Z|12616549|poll_loop|INFO|Dropped 280 log messages in last 10 seconds (most recently, 9 seconds ago) due to excessive rate
  2022-04-15T09:53:22.156Z|12616550|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (127.0.0.1:30242<->127.0.0.1:6640) at lib/stream-fd.c:157 (100% CPU usage)
  2022-04-15T09:53:22.156Z|12616551|jsonrpc|WARN|tcp:127.0.0.1:6640: send error: Broken pipe
  2022-04-15T09:53:22.156Z|12616552|jsonrpc|WARN|tcp:172.19.20.11:6642: send error: Broken pipe
  2022-04-15T09:53:22.160Z|12616553|reconnect|WARN|tcp:127.0.0.1:6640: connection dropped (Broken pipe)
  2022-04-15T09:53:22.160Z|12616554|reconnect|WARN|tcp:172.19.20.11:6642: connection dropped (Broken pipe)
  2022-04-15T09:53:23.160Z|12616555|reconnect|INFO|tcp:127.0.0.1:6640: connecting...
  2022-04-15T09:53:23.160Z|12616556|reconnect|INFO|tcp:172.19.20.10:6642: connecting...
  2022-04-15T09:53:23.163Z|12616557|reconnect|INFO|tcp:127.0.0.1:6640: connected
  2022-04-15T09:53:23.163Z|12616558|reconnect|INFO|tcp:172.19.20.10:6642: connected
  2022-04-15T09:53:23.165Z|12616559|main|INFO|OVNSB commit failed, force recompute next time.
  2022-04-15T09:53:31.991Z|12616569|timeval|WARN|Unreasonably long 8805ms poll interval (8708ms user, 94ms system)
  2022-04-15T09:53:31.991Z|12616570|timeval|WARN|faults: 69251 minor, 0 major
  2022-04-15T09:53:31.991Z|12616571|timeval|WARN|context switches: 0 voluntary, 19 involuntary
  2022-04-15T09:53:31.991Z|12616572|poll_loop|INFO|Dropped 406 log messages in last 10 seconds (most recently, 9 seconds ago) due to excessive rate
  2022-04-15T09:53:31.991Z|12616573|poll_loop|INFO|wakeup due to [POLLIN] on fd 16 (127.0.0.1:30540<->127.0.0.1:6640) at lib/stream-fd.c:157 (100% CPU usage)
  2022-04-15T09:53:31.991Z|12616574|poll_loop|INFO|wakeup due to [POLLIN] on fd 19 (172.19.20.100:8744<->172.19.20.10:6642) at lib/stream-fd.c:157 (100% CPU usage)

  2. ovsdb-server on compute B:
  2022-04-15T09:15:50.489Z|79819|jsonrpc|WARN|tcp:127.0.0.1:31970: receive error: Connection reset by peer
  2022-04-15T09:15:50.489Z|79820|reconnect|WARN|tcp:127.0.0.1:31970: connection dropped (Connection reset by peer)
  2022-04-15T09:17:50.763Z|79821|reconnect|ERR|tcp:127.0.0.1:32952: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:19:19.271Z|79822|reconnect|ERR|tcp:127.0.0.1:34554: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:19:50.806Z|79823|reconnect|ERR|tcp:127.0.0.1:35858: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:20:49.662Z|79824|reconnect|ERR|tcp:127.0.0.1:36300: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:21:21.746Z|79825|reconnect|ERR|tcp:127.0.0.1:36722: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:22:01.375Z|79826|reconnect|ERR|tcp:127.0.0.1:36914: no response to inactivity probe after 5.01 seconds, disconnecting
  2022-04-15T09:22:57.564Z|79827|reconnect|ERR|tcp:127.0.0.1:37582: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:23:44.051Z|79828|reconnect|ERR|tcp:127.0.0.1:38234: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:24:27.935Z|79829|reconnect|ERR|tcp:127.0.0.1:39244: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:25:03.414Z|79830|reconnect|ERR|tcp:127.0.0.1:39484: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:25:58.860Z|79831|reconnect|ERR|tcp:127.0.0.1:39700: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:27:34.469Z|79832|reconnect|ERR|tcp:127.0.0.1:40224: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:29:07.546Z|79833|reconnect|ERR|tcp:127.0.0.1:41346: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:29:52.822Z|79834|reconnect|ERR|tcp:127.0.0.1:41930: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:30:40.555Z|79835|reconnect|ERR|tcp:127.0.0.1:42470: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:31:19.660Z|79836|reconnect|ERR|tcp:127.0.0.1:42872: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:32:07.061Z|79837|reconnect|ERR|tcp:127.0.0.1:43214: no response to inactivity probe after 5 seconds, disconnecting

  3. ovn-sb-db on controller openstack:
  2022-04-15T09:55:37.516Z|322313|jsonrpc|WARN|tcp:172.19.20.12:63278: receive error: Connection reset by peer
  2022-04-15T09:55:37.516Z|322314|reconnect|WARN|tcp:172.19.20.12:63278: connection dropped (Connection reset by peer)
  2022-04-15T09:55:46.857Z|322315|reconnect|ERR|tcp:172.19.20.11:35978: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:55:46.974Z|322316|reconnect|ERR|tcp:172.19.20.12:22952: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:55:49.367Z|322317|reconnect|ERR|tcp:172.19.20.101:12810: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:55:50.302Z|322318|reconnect|ERR|tcp:172.19.20.31:59604: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:55:55.287Z|322319|reconnect|ERR|tcp:172.19.20.102:61964: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:56:30.760Z|322320|reconnect|ERR|tcp:172.19.20.102:62190: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:56:48.494Z|322321|reconnect|ERR|tcp:172.19.20.105:53684: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:56:48.765Z|322322|reconnect|ERR|tcp:172.19.20.10:22338: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:57:05.311Z|322323|reconnect|ERR|tcp:172.19.20.102:62422: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:57:13.860Z|322324|reconnect|ERR|tcp:172.19.20.30:38406: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:57:39.726Z|322325|reconnect|ERR|tcp:172.19.20.102:62642: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:58:12.186Z|322326|reconnect|ERR|tcp:172.19.20.12:32468: no response to inactivity probe after 5 seconds, disconnecting
  2022-04-15T09:58:14.320Z|322327|reconnect|ERR|tcp:172.19.20.102:62820: no response to inactivity probe after 5 seconds, disconnecting

  4. ovn-sb-db on controller:
  2022-04-15T01:28:50.733Z|22383|jsonrpc|WARN|tcp:172.19.20.12:25398: receive error: Connection reset by peer
  2022-04-15T01:28:50.733Z|22384|reconnect|WARN|tcp:172.19.20.12:25398: connection dropped (Connection reset by peer)
  2022-04-15T03:47:16.056Z|22385|jsonrpc|WARN|tcp:172.19.20.11:14554: receive error: Connection reset by peer
  2022-04-15T03:47:16.056Z|22386|reconnect|WARN|tcp:172.19.20.11:14554: connection dropped (Connection reset by peer)
  2022-04-15T03:47:31.910Z|22387|jsonrpc|WARN|tcp:172.19.20.11:15666: receive error: Connection reset by peer
  2022-04-15T03:47:31.910Z|22388|reconnect|WARN|tcp:172.19.20.11:15666: connection dropped (Connection reset by peer)
  2022-04-15T04:06:03.251Z|22389|jsonrpc|WARN|tcp:172.19.20.10:60386: receive error: Connection reset by peer
  2022-04-15T04:06:03.251Z|22390|reconnect|WARN|tcp:172.19.20.10:60386: connection dropped (Connection reset by peer)
  2022-04-15T05:54:59.704Z|22391|jsonrpc|WARN|tcp:172.19.20.12:59450: receive error: Connection reset by peer
  2022-04-15T05:54:59.704Z|22392|reconnect|WARN|tcp:172.19.20.12:59450: connection dropped (Connection reset by peer)
  2022-04-15T07:16:21.438Z|22393|jsonrpc|WARN|tcp:172.19.20.12:59508: receive error: Connection reset by peer
  2022-04-15T07:16:21.438Z|22394|reconnect|WARN|tcp:172.19.20.12:59508: connection dropped (Connection reset by peer)
  2022-04-15T07:16:31.218Z|22395|jsonrpc|WARN|tcp:172.19.20.10:43826: receive error: Connection reset by peer
  2022-04-15T07:16:31.218Z|22396|reconnect|WARN|tcp:172.19.20.10:43826: connection dropped (Connection reset by peer)
  2022-04-15T07:16:31.356Z|22397|jsonrpc|WARN|tcp:172.19.20.11:50460: receive error: Connection reset by peer
  2022-04-15T07:16:31.356Z|22398|reconnect|WARN|tcp:172.19.20.11:50460: connection dropped (Connection reset by peer)
  2022-04-15T08:20:10.313Z|22399|jsonrpc|WARN|tcp:172.19.20.10:25824: receive error: Connection reset by peer
  2022-04-15T08:20:10.313Z|22400|reconnect|WARN|tcp:172.19.20.10:25824: connection dropped (Connection reset by peer)
  2022-04-15T08:20:19.600Z|22401|jsonrpc|WARN|tcp:172.19.20.12:28378: receive error: Connection reset by peer
  2022-04-15T08:20:19.600Z|22402|reconnect|WARN|tcp:172.19.20.12:28378: connection dropped (Connection reset by peer)
  2022-04-15T08:52:14.280Z|22403|jsonrpc|WARN|tcp:172.19.20.11:19300: receive error: Connection reset by peer
  2022-04-15T08:52:14.280Z|22404|reconnect|WARN|tcp:172.19.20.11:19300: connection dropped (Connection reset by peer)

  
  Question:
  1. ovsdb disconnecting after 5s, how i can fix it?
  2. ovn-controller response OVNSB commit failed, force recompute next time. What is it?
  3. ovn-controller log 100% CPU Usage. Looks like it only uses 1 cpu for processing. Maybe the this cpu of compute is being used by vm. How i can fix it? Is it possible for it to use more cpu?
  4. Why live-migrate can't down ovn-controller? Is it a flow conflict? Is there any way to deal with this situation?

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1969354/+subscriptions



References