← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1247603] Re: nova-conductor process can't create cosumer connection to qpid after HeartbeatTimeout in heavy workload

 

** Changed in: nova
       Status: Incomplete => Opinion

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1247603

Title:
  nova-conductor process can't create cosumer connection to qpid after
  HeartbeatTimeout in heavy workload

Status in OpenStack Compute (Nova):
  Opinion

Bug description:
  nova-conductor will lose the queue and not able to get requests
  anymore after running workload for some time. This also occured in
  process nova-compute. They share same impl_qpid.py.

  When nova-conductor with heavy workload,  a  exception
  HeartbeatTimeout will be raised.  The exceptin will be caught and try
  to reconnect to qpid server.  logs shows we can't reconnect qoid  in
  method iterconsume , but can reconnect qpid server in method
  publisher_send. That means we can't only send message to the qpid
  queue, but can't receive message from qpid queue.

  impl_qpid.py 
      def ensure(self, error_callback, method, *args, **kwargs):
          while True:
              try:
                  return method(*args, **kwargs)          ---------------------------> raise  HeartbeatTimeout
              except (qpid_exceptions.Empty,
                      qpid_exceptions.ConnectionError), e:
                  if error_callback:
                      error_callback(e)
                  self.reconnect()     ------------------------------> retry 

  
  method ensure is used in 

      def iterconsume(self, limit=None, timeout=None):
          """Return an iterator that will consume from all queues/consumers"""

          def _error_callback(exc):
              if isinstance(exc, qpid_exceptions.Empty):
                  LOG.debug(_('Timed out waiting for RPC response: %s') %
                            str(exc))
                  raise rpc_common.Timeout()
              else:
                  LOG.exception(_('Failed to consume message from queue: %s') %
                                str(exc))

          def _consume():
              nxt_receiver = self.session.next_receiver(timeout=timeout)
              try:
                  self._lookup_consumer(nxt_receiver).consume()
              except Exception:
                  LOG.exception(_("Error processing message.  Skipping it."))

          for iteration in itertools.count(0):
              if limit and iteration >= limit:
                  raise StopIteration
              yield self.ensure(_error_callback, _consume)   ---------------------->   here can't reconnect 

  
  and 

      def publisher_send(self, cls, topic, msg):
          """Send to a publisher based on the publisher class"""

          def _connect_error(exc):
              log_info = {'topic': topic, 'err_str': str(exc)}
              LOG.exception(_("Failed to publish message to topic "
                            "'%(topic)s': %(err_str)s") % log_info)

          def _publisher_send():
              publisher = cls(self.conf, self.session, topic)
              publisher.send(msg)

          return self.ensure(_connect_error, _publisher_send)
  ------------------> here can reconnect.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1247603/+subscriptions