yahoo-eng-team team mailing list archive

Thread
Date

[Bug 2052718] [NEW] Nova Compute Service status goes up and down abnormally

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: VO LE HUY <2052718@xxxxxxxxxxxxxxxxxx>
Date: Thu, 08 Feb 2024 16:39:50 -0000
Reply-to: Bug 2052718 <2052718@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx

Public bug reported:

Hi community,

When you type:
$ watch openstack nova compute service list

You will see the status of components such as "nova-conductor", "nova-
scheduler" and "nova-compute" continuously jump between "up" and "down",
this is not normal and it has absolutely no end. This happens and a
chain of functions are affected such as live-migrate, creating virtual
machines,... while checking, the compute server still works.

After some debugging I discovered that the current time "now" retrieved
from the system is smaller than the past time "last_heartbeat"!!! And
the mistake of not pointing this out in the source code cost me a lot of
time to understand. In fact, as soon as the current time is received,
there should be a comparison calculation with the last heartbeat. If it
is small, we need to log the warning to easily configure time
synchronization for the cluster.

Around the abs(elapsed) line of code ->
https://github.com/openstack/nova/blob/stable/2023.2/nova/servicegroup/drivers/db.py

...
...
def is_up(self, service_ref):
        ...
        ...
        # Timestamps in DB are UTC.
        elapsed = timeutils.delta_seconds(last_heartbeat, timeutils.utcnow())
        is_up = abs(elapsed) <= self.service_down_time
        if not is_up:
            LOG.debug('Seems service %(binary)s on host %(host)s is down. '
                      'Last heartbeat was %(lhb)s. Elapsed time is %(el)s',
                      {'binary': service_ref.get('binary'),
                       'host': service_ref.get('host'),
                       'lhb': str(last_heartbeat), 'el': str(elapsed)})
        return is_up
...
...

** Affects: nova
     Importance: Undecided
         Status: New

** Attachment added: "Up/down continuously"
   https://bugs.launchpad.net/bugs/2052718/+attachment/5745271/+files/MicrosoftTeams-image.png

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2052718

Title:
  Nova Compute Service status goes up and down abnormally

Status in OpenStack Compute (nova):
  New

Bug description:
  Hi community,

  When you type:
  $ watch openstack nova compute service list

  You will see the status of components such as "nova-conductor", "nova-
  scheduler" and "nova-compute" continuously jump between "up" and
  "down", this is not normal and it has absolutely no end. This happens
  and a chain of functions are affected such as live-migrate, creating
  virtual machines,... while checking, the compute server still works.

  After some debugging I discovered that the current time "now"
  retrieved from the system is smaller than the past time
  "last_heartbeat"!!! And the mistake of not pointing this out in the
  source code cost me a lot of time to understand. In fact, as soon as
  the current time is received, there should be a comparison calculation
  with the last heartbeat. If it is small, we need to log the warning to
  easily configure time synchronization for the cluster.

  Around the abs(elapsed) line of code ->
  https://github.com/openstack/nova/blob/stable/2023.2/nova/servicegroup/drivers/db.py

  ...
  ...
  def is_up(self, service_ref):
          ...
          ...
          # Timestamps in DB are UTC.
          elapsed = timeutils.delta_seconds(last_heartbeat, timeutils.utcnow())
          is_up = abs(elapsed) <= self.service_down_time
          if not is_up:
              LOG.debug('Seems service %(binary)s on host %(host)s is down. '
                        'Last heartbeat was %(lhb)s. Elapsed time is %(el)s',
                        {'binary': service_ref.get('binary'),
                         'host': service_ref.get('host'),
                         'lhb': str(last_heartbeat), 'el': str(elapsed)})
          return is_up
  ...
  ...

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2052718/+subscriptions

Follow ups

[Bug 2052718] Re: Compute service status still up with nagative elapsed time
From: OpenStack Infra, 2024-02-09
[Bug 2052718] Re: Nova Compute Service status goes up and down abnormally
From: sean mooney, 2024-02-08