← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1899541] Re: archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'

 

** Changed in: nova/victoria
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1899541

Title:
  archive_deleted_rows archives pci_devices records as residue because
  of 'instance_uuid'

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) queens series:
  In Progress
Status in OpenStack Compute (nova) rocky series:
  In Progress
Status in OpenStack Compute (nova) stein series:
  In Progress
Status in OpenStack Compute (nova) train series:
  In Progress
Status in OpenStack Compute (nova) ussuri series:
  In Progress
Status in OpenStack Compute (nova) victoria series:
  Fix Released

Bug description:
  This is based on a bug reported downstream [1] where after a random
  amount of time, update_available_resource began to fail with the
  following trace on nodes with PCI devices:

    "traceback": [
      "Traceback (most recent call last):",
      "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
      "    rt.update_available_resource(context, nodename)",
      "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
      "    self._update_available_resource(context, resources)",
      "  File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
      "    return f(*args, **kwargs)",
      "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
      "    self._update(context, cn)",
      "  File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",
      "    self.pci_tracker.save(context)",
      "  File \"/usr/lib/python2.7/site-packages/nova/pci/manager.py\", line 92, in save",
      "    dev.save()",
      "  File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 210, in wrapper",
      "    ctxt, self, fn.__name__, args, kwargs)",
      "  File \"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py\", line 245, in object_action",
      "    objmethod=objmethod, args=args, kwargs=kwargs)",
      "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call",
      "    retry=self.retry)",
      "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send",
      "    timeout=timeout, retry=retry)",
      "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send",
      "    retry=retry)",
      "  File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send",
      "    raise result",
      "RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at: http://sqlalche.me/e/gkpj)",


  Here ^ we see an attempt to insert a nearly empty (NULL fields) record
  into the pci_devices table. Inspection of the code shows that the way
  this can occur is if we fail to lookup the pci_devices record we want
  and then we try to create a new one [2]:


  @pick_context_manager_writer
  def pci_device_update(context, node_id, address, values):
      query = model_query(context, models.PciDevice, read_deleted="no").\
                      filter_by(compute_node_id=node_id).\
                      filter_by(address=address)
      if query.update(values) == 0:
          device = models.PciDevice()
          device.update(values)
          context.session.add(device)
      return query.one()


  Turns out what was happening was when a request came in to delete an
  instance that had allocated a PCI device, if the archive_deleted_rows
  cron job fired at just the right (wrong) moment, it would sweep away
  the pci_devices record matching the instance_uuid because archive is
  treating any table with an 'instance_uuid' column as instance
  "residue" needing cleanup.

  So after the pci_devices record was swept away, we tried to update the
  resource tracker as part of the _complete_deletion method in the
  compute manager and that failed because we could not locate the
  pci_devices record to free the PCI device (null out the instance_uuid
  field).

  What we need to do here is not to treat the pci_devices table records
  as instance residue. The records in pci_devices are not tied to
  instance lifecycles at all and they are managed independently by the
  PCI trackers.

  [1] https://bugzilla.redhat.com/show_bug.cgi?id=1867124
  [2] https://github.com/openstack/nova/blob/261de76104ca67bed3ea6cdbcaaab0e44030f1e2/nova/db/sqlalchemy/api.py#L4406-L4409

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1899541/+subscriptions


References