yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1988039] Re: ovs idl not monitor tables after reconnect

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Rodolfo Alonso <1988039@xxxxxxxxxxxxxxxxxx>
Date: Thu, 13 Apr 2023 13:57:36 -0000
Reply-to: Bug 1988039 <1988039@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
** Also affects: neutron
   Importance: Undecided
       Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1988039

Title:
  ovs idl not monitor tables after reconnect

Status in neutron:
  New
Status in ovsdbapp:
  In Progress

Bug description:
  I came across a strange phenomenon. When I restart a node (there are
  nb, nb and mariadb services on the node), after a period of time,
  through the "openstack network agent list", I found that sometimes all
  agents are down.

  By observing the log, it is found that when  some processes are
  requested, all agents are down.

  Through packet capture, sb not notify this process of the change of
  table chassis_private, only notify the change of table Database.

  After checking the log of the process, it is found that an exception
  was printed [1] when the process connected to sb for the last time.

  You can see the following code[2]:

  def run(self):
  	errors = 0
  	while self.is_running:
  		# If we fail in an Idl call, we could have missed an update
  		# from the server, leaving us out of sync with ovsdb-server.
  		# It is not safe to continue without restarting the connection.
  		# Though it is likely that the error is unrecoverable, keep trying
  		# indefinitely just in case.
  		try:
  			self.idl.wait(self.poller)
  			self.poller.fd_wait(self.txns.alert_fileno, poller.POLLIN)
  			self.poller.block()
  			with self.lock:
  				self.idl.run()    -------- point-1
  		except Exception as e:
  			# This shouldn't happen, but is possible if there is a bug
  			# in python-ovs
  			errors += 1
  			LOG.exception(e)
  			with self.lock:
  				self.idl.force_reconnect() -------- point-2
  				try:
  					idlutils.wait_for_change(self.idl, self.timeout) ------ ponit-3
  				except Exception as e:
  					# This could throw the same exception as idl.run()
  					# or Exception("timeout"), either way continue
  					LOG.exception(e)
  			sleep = min(2 ** errors, 60)
  			LOG.info("Trying to recover, sleeping %s seconds", sleep)

  When we process the notification, if an exception occurs(Unable to connect to the database), it will be thrown from the mark point-1. and then reconnect(ponit-2).  in point-3, we will send_server_monitor and handle table Database changes. if we still cannot connect to the database at this time, we will handle the exception[3]. 
  At this time, the following actions cannot be performed(send_monitor).

  
  [1] https://opendev.org/openstack/ovsdbapp/src/commit/96cf8d6288587423e65d5149016e07fb51430724/ovsdbapp/backend/ovs_idl/connection.py#L121
  [2] https://opendev.org/openstack/ovsdbapp/src/commit/96cf8d6288587423e65d5149016e07fb51430724/ovsdbapp/backend/ovs_idl/connection.py#L95-L123

  [3]
  https://opendev.org/openstack/neutron/src/commit/7dfe41ab8f9ecf6266c7a51c0223ff8f8822c16f/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovsdb_monitor.py#L719

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1988039/+subscriptions