yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #84159
[Bug 1899541] [NEW] archive_deleted_rows archives pci_devices records as residue because of 'instance_uuid'
Public bug reported:
This is based on a bug reported downstream [1] where after a random
amount of time, update_available_resource began to fail with the
following trace on nodes with PCI devices:
"traceback": [
"Traceback (most recent call last):",
" File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
" rt.update_available_resource(context, nodename)",
" File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
" self._update_available_resource(context, resources)",
" File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
" return f(*args, **kwargs)",
" File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
" self._update(context, cn)",
" File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",
" self.pci_tracker.save(context)",
" File \"/usr/lib/python2.7/site-packages/nova/pci/manager.py\", line 92, in save",
" dev.save()",
" File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 210, in wrapper",
" ctxt, self, fn.__name__, args, kwargs)",
" File \"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py\", line 245, in object_action",
" objmethod=objmethod, args=args, kwargs=kwargs)",
" File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call",
" retry=self.retry)",
" File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send",
" timeout=timeout, retry=retry)",
" File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send",
" retry=retry)",
" File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send",
" raise result",
"RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at: http://sqlalche.me/e/gkpj)",
Here ^ we see an attempt to insert a nearly empty (NULL fields) record
into the pci_devices table. Inspection of the code shows that the way
this can occur is if we fail to lookup the pci_devices record we want
and then we try to create a new one [2]:
@pick_context_manager_writer
def pci_device_update(context, node_id, address, values):
query = model_query(context, models.PciDevice, read_deleted="no").\
filter_by(compute_node_id=node_id).\
filter_by(address=address)
if query.update(values) == 0:
device = models.PciDevice()
device.update(values)
context.session.add(device)
return query.one()
Turns out what was happening was when a request came in to delete an
instance that had allocated a PCI device, if the archive_deleted_rows
cron job fired at just the right (wrong) moment, it would sweep away the
pci_devices record matching the instance_uuid because archive is
treating any table with an 'instance_uuid' column as instance "residue"
needing cleanup.
So after the pci_devices record was swept away, we tried to update the
resource tracker as part of the _complete_deletion method in the compute
manager and that failed because we could not locate the pci_devices
record to free the PCI device (null out the instance_uuid field).
What we need to do here is not to treat the pci_devices table records as
instance residue. The records in pci_devices are not tied to instance
lifecycles at all and they are managed independently by the PCI
trackers.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1867124
[2] https://github.com/openstack/nova/blob/261de76104ca67bed3ea6cdbcaaab0e44030f1e2/nova/db/sqlalchemy/api.py#L4406-L4409
** Affects: nova
Importance: Medium
Assignee: melanie witt (melwitt)
Status: In Progress
** Tags: compute db pci
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1899541
Title:
archive_deleted_rows archives pci_devices records as residue because
of 'instance_uuid'
Status in OpenStack Compute (nova):
In Progress
Bug description:
This is based on a bug reported downstream [1] where after a random
amount of time, update_available_resource began to fail with the
following trace on nodes with PCI devices:
"traceback": [
"Traceback (most recent call last):",
" File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 7447, in update_available_resource_for_node",
" rt.update_available_resource(context, nodename)",
" File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 706, in update_available_resource",
" self._update_available_resource(context, resources)",
" File \"/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py\", line 274, in inner",
" return f(*args, **kwargs)",
" File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 782, in _update_available_resource",
" self._update(context, cn)",
" File \"/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py\", line 926, in _update",
" self.pci_tracker.save(context)",
" File \"/usr/lib/python2.7/site-packages/nova/pci/manager.py\", line 92, in save",
" dev.save()",
" File \"/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py\", line 210, in wrapper",
" ctxt, self, fn.__name__, args, kwargs)",
" File \"/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py\", line 245, in object_action",
" objmethod=objmethod, args=args, kwargs=kwargs)",
" File \"/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py\", line 174, in call",
" retry=self.retry)",
" File \"/usr/lib/python2.7/site-packages/oslo_messaging/transport.py\", line 131, in _send",
" timeout=timeout, retry=retry)",
" File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 559, in send",
" retry=retry)",
" File \"/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py\", line 550, in _send",
" raise result",
"RemoteError: Remote error: DBError (pymysql.err.IntegrityError) (1048, u\"Column 'compute_node_id' cannot be null\") [SQL: u'INSERT INTO pci_devices (created_at, updated_at, deleted_at, deleted, uuid, compute_node_id, address, vendor_id, product_id, dev_type, dev_id, label, status, request_id, extra_info, instance_uuid, numa_node, parent_addr) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(uuid)s, %(compute_node_id)s, %(address)s, %(vendor_id)s, %(product_id)s, %(dev_type)s, %(dev_id)s, %(label)s, %(status)s, %(request_id)s, %(extra_info)s, %(instance_uuid)s, %(numa_node)s, %(parent_addr)s)'] [parameters: {'status': u'available', 'instance_uuid': None, 'dev_type': None, 'uuid': None, 'dev_id': None, 'parent_addr': None, 'numa_node': None, 'created_at': datetime.datetime(2020, 8, 7, 11, 51, 19, 643044), 'vendor_id': None, 'updated_at': None, 'label': None, 'deleted': 0, 'extra_info': '{}', 'compute_node_id': None, 'request_id': None, 'deleted_at': None, 'address': None, 'product_id': None}] (Background on this error at: http://sqlalche.me/e/gkpj)",
Here ^ we see an attempt to insert a nearly empty (NULL fields) record
into the pci_devices table. Inspection of the code shows that the way
this can occur is if we fail to lookup the pci_devices record we want
and then we try to create a new one [2]:
@pick_context_manager_writer
def pci_device_update(context, node_id, address, values):
query = model_query(context, models.PciDevice, read_deleted="no").\
filter_by(compute_node_id=node_id).\
filter_by(address=address)
if query.update(values) == 0:
device = models.PciDevice()
device.update(values)
context.session.add(device)
return query.one()
Turns out what was happening was when a request came in to delete an
instance that had allocated a PCI device, if the archive_deleted_rows
cron job fired at just the right (wrong) moment, it would sweep away
the pci_devices record matching the instance_uuid because archive is
treating any table with an 'instance_uuid' column as instance
"residue" needing cleanup.
So after the pci_devices record was swept away, we tried to update the
resource tracker as part of the _complete_deletion method in the
compute manager and that failed because we could not locate the
pci_devices record to free the PCI device (null out the instance_uuid
field).
What we need to do here is not to treat the pci_devices table records
as instance residue. The records in pci_devices are not tied to
instance lifecycles at all and they are managed independently by the
PCI trackers.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1867124
[2] https://github.com/openstack/nova/blob/261de76104ca67bed3ea6cdbcaaab0e44030f1e2/nova/db/sqlalchemy/api.py#L4406-L4409
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1899541/+subscriptions
Follow ups