yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1899541] [NEW] archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: melanie witt <1899541@xxxxxxxxxxxxxxxxxx>
Date: Mon, 12 Oct 2020 22:46:39 -0000
Reply-to: Bug 1899541 <1899541@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

This is based on a bug reported downstream [1] where after a random
amount of time, update_available_resource began to fail with the
following trace on nodes with PCI devices:

  "traceback": [
    "Traceback (most recent call last):",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
    "    rt.update_available_resource(context, nodename)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
    "    self._update_available_resource(context, resources)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
    "    return f(*args, **kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
    "    self._update(context, cn)",
    "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",
    "    self.pci_tracker.save(context)",
    "  File \"/usr/lib/python2.7/site-packages/nova/pci/manager.py\", line 92, in save",
    "    dev.save()",
    "  File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 210, in wrapper",
    "    ctxt, self, fn.__name__, args, kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py\", line 245, in object_action",
    "    objmethod=objmethod, args=args, kwargs=kwargs)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call",
    "    retry=self.retry)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send",
    "    timeout=timeout, retry=retry)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send",
    "    retry=retry)",
    "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send",
    "    raise result",
    "RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at: http://sqlalche.me/e/gkpj)",


Here ^ we see an attempt to insert a nearly empty (NULL fields) record
into the pci_devices table. Inspection of the code shows that the way
this can occur is if we fail to lookup the pci_devices record we want
and then we try to create a new one [2]:


@pick_context_manager_writer
def pci_device_update(context, node_id, address, values):
    query = model_query(context, models.PciDevice, read_deleted="no").\
                    filter_by(compute_node_id=node_id).\
                    filter_by(address=address)
    if query.update(values) == 0:
        device = models.PciDevice()
        device.update(values)
        context.session.add(device)
    return query.one()


Turns out what was happening was when a request came in to delete an
instance that had allocated a PCI device, if the archive_deleted_rows
cron job fired at just the right (wrong) moment, it would sweep away the
pci_devices record matching the instance_uuid because archive is
treating any table with an 'instance_uuid' column as instance "residue"
needing cleanup.

So after the pci_devices record was swept away, we tried to update the
resource tracker as part of the _complete_deletion method in the compute
manager and that failed because we could not locate the pci_devices
record to free the PCI device (null out the instance_uuid field).

What we need to do here is not to treat the pci_devices table records as
instance residue. The records in pci_devices are not tied to instance
lifecycles at all and they are managed independently by the PCI
trackers.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1867124
[2] https://github.com/openstack/nova/blob/261de76104ca67bed3ea6cdbcaaab0e44030f1e2/nova/db/sqlalchemy/api.py#L4406-L4409

** Affects: nova
     Importance: Medium
     Assignee: melanie witt (melwitt)
         Status: In Progress


** Tags: compute db pci

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1899541

Title:
  archive_deleted_rows archives pci_devices records as residue because
  of 'instance_uuid'

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  This is based on a bug reported downstream [1] where after a random
  amount of time, update_available_resource began to fail with the
  following trace on nodes with PCI devices:

    "traceback": [
      "Traceback (most recent call last):",
      "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
      "    rt.update_available_resource(context, nodename)",
      "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
      "    self._update_available_resource(context, resources)",
      "  File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
      "    return f(*args, **kwargs)",
      "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
      "    self._update(context, cn)",
      "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",
      "    self.pci_tracker.save(context)",
      "  File \"/usr/lib/python2.7/site-packages/nova/pci/manager.py\", line 92, in save",
      "    dev.save()",
      "  File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 210, in wrapper",
      "    ctxt, self, fn.__name__, args, kwargs)",
      "  File \"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py\", line 245, in object_action",
      "    objmethod=objmethod, args=args, kwargs=kwargs)",
      "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call",
      "    retry=self.retry)",
      "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send",
      "    timeout=timeout, retry=retry)",
      "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send",
      "    retry=retry)",
      "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send",
      "    raise result",
      "RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at: http://sqlalche.me/e/gkpj)",


  Here ^ we see an attempt to insert a nearly empty (NULL fields) record
  into the pci_devices table. Inspection of the code shows that the way
  this can occur is if we fail to lookup the pci_devices record we want
  and then we try to create a new one [2]:


  @pick_context_manager_writer
  def pci_device_update(context, node_id, address, values):
      query = model_query(context, models.PciDevice, read_deleted="no").\
                      filter_by(compute_node_id=node_id).\
                      filter_by(address=address)
      if query.update(values) == 0:
          device = models.PciDevice()
          device.update(values)
          context.session.add(device)
      return query.one()


  Turns out what was happening was when a request came in to delete an
  instance that had allocated a PCI device, if the archive_deleted_rows
  cron job fired at just the right (wrong) moment, it would sweep away
  the pci_devices record matching the instance_uuid because archive is
  treating any table with an 'instance_uuid' column as instance
  "residue" needing cleanup.

  So after the pci_devices record was swept away, we tried to update the
  resource tracker as part of the _complete_deletion method in the
  compute manager and that failed because we could not locate the
  pci_devices record to free the PCI device (null out the instance_uuid
  field).

  What we need to do here is not to treat the pci_devices table records
  as instance residue. The records in pci_devices are not tied to
  instance lifecycles at all and they are managed independently by the
  PCI trackers.

  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1867124
  [2] https://github.com/openstack/nova/blob/261de76104ca67bed3ea6cdbcaaab0e44030f1e2/nova/db/sqlalchemy/api.py#L4406-L4409

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1899541/+subscriptions

Follow ups

[Bug 1899541] Re: archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'
From: melanie witt, 2024-09-04
[Bug 1899541] Re: archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'
From: melanie witt, 2024-09-04
[Bug 1899541] Fix included in openstack/nova stein-eol
From: OpenStack Infra, 2022-11-11
[Bug 1899541] Re: archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'
From: Elod Illes, 2021-06-03
[Bug 1899541] Re: archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'
From: Elod Illes, 2021-04-13
[Bug 1899541] Re: archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'
From: OpenStack Infra, 2020-10-18
[Bug 1899541] Re: archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'
From: melanie witt, 2020-10-14