← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 2060758] [NEW] nova compute service will down when ceph public network down

 

Public bug reported:

Description
===========
We use ceph as the backend storage of nova, but if the compute node's ceph public network down, the nova compute progress will hangs and it's heartbeat will not report again, this will lead to the compute node's nova_compute service down.

Reasons of problem:
nova-compute periodic task need check disk usage, in this process it needs to connect ceph cluster by rados client, but this process will hangs when ceph publice network down.
The reason as below:
    def _connect_to_rados(self, pool=None):
        client = rados.Rados(rados_id=self.rbd_user,
                                  conffile=self.ceph_conf)
        try:
            client.connect(timeout=self.rbd_connect_timeout)
            pool_to_open = pool or self.pool
            # NOTE(luogangyi): open_ioctx >= 10.1.0 could handle unicode
            # arguments perfectly as part of Python 3 support.
            # Therefore, when we turn to Python 3, it's safe to remove
            # str() conversion.
            ioctx = client.open_ioctx(str(pool_to_open))
            return client, ioctx
        except rados.Error:
            # shutdown cannot raise an exception
            client.shutdown()
            raise
client.connect() parameter timeout has been abandoned begin with ceph Nautilus version, instead, use client_mount_timeout parameter in ceph.conf. So if storage public network down, the rados client will use default timeout mechanism, the total timeout period is 50 minutes。The single timeout duration is 5 minutes and retry 10 times.

We should set client_mount_timeout para in ceph.conf file to resolve
this issue.

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: ceph nova

** Tags added: ceph nova

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2060758

Title:
  nova compute service will down when ceph public network down

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  We use ceph as the backend storage of nova, but if the compute node's ceph public network down, the nova compute progress will hangs and it's heartbeat will not report again, this will lead to the compute node's nova_compute service down.

  Reasons of problem:
  nova-compute periodic task need check disk usage, in this process it needs to connect ceph cluster by rados client, but this process will hangs when ceph publice network down.
  The reason as below:
      def _connect_to_rados(self, pool=None):
          client = rados.Rados(rados_id=self.rbd_user,
                                    conffile=self.ceph_conf)
          try:
              client.connect(timeout=self.rbd_connect_timeout)
              pool_to_open = pool or self.pool
              # NOTE(luogangyi): open_ioctx >= 10.1.0 could handle unicode
              # arguments perfectly as part of Python 3 support.
              # Therefore, when we turn to Python 3, it's safe to remove
              # str() conversion.
              ioctx = client.open_ioctx(str(pool_to_open))
              return client, ioctx
          except rados.Error:
              # shutdown cannot raise an exception
              client.shutdown()
              raise
  client.connect() parameter timeout has been abandoned begin with ceph Nautilus version, instead, use client_mount_timeout parameter in ceph.conf. So if storage public network down, the rados client will use default timeout mechanism, the total timeout period is 50 minutes。The single timeout duration is 5 minutes and retry 10 times.

  We should set client_mount_timeout para in ceph.conf file to resolve
  this issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2060758/+subscriptions