yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #92699
[Bug 2033247] [NEW] PCI Leaks when multiple detach operations performed in parallel
Public bug reported:
We are using the Openstack yoga release and we need to attach/detach the
port to a VM dynamically. We are observing the PCI leaks while doing
multiple detach operations simultaneously.
PCI starts leaking when one of the openstack table “instance_extra” is
exhausted. Once this table is exhausted then openstack is not able to
attach/detach a port and it starts leaking PCIs due to exception as it
can’t perform any action. This table is used by Openstack to record the
all historical records of “PCIRequest” for the all interfaces attached
with a VM.
DBDataError (pymysql.err.DataError) (1406, "Data too long for column 'pci_requests' at row 1")
[SQL: UPDATE instance_extra SET updated_at=%(updated_at)s, device_metadata=%(device_metadata)s, numa_topology=%(numa_topology)s, pci_requests=%(pci_requests)s, flavor=%(flavor)s WHERE instance_extra.deleted = %(deleted_1)s AND instance_extra.instance_uuid = %(instance_uuid_1)s]
[parameters: {'updated_at': datetime.datetime(2023, 8, 3, 14, 39, 56, 116791), 'device_metadata': '{"nova_object.name": "InstanceDeviceMetadata", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"devices": [{"nova ... (6168 characters truncated) ... :00:14.0"}, "nova_object.changes": ["address"]}}, "nova_object.changes": ["bus", "vf_trusted", "mac", "vlan"]}]}, "nova_object.changes": ["devices"]}', 'numa_topology': '{"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_obj ... (946 characters truncated) ... nges": ["id", "cpu_pinning_raw", "cpuset_reserved"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}', 'pci_requests': '[{"count": 1, "spec": [{"physical_network": "sriov3", "remote_managed": "False"}], "alias_name": null, "is_new": false, "numa_policy": null, "request ... (65464 characters truncated) ... "is_new": false, "numa_policy": null, "request_id": "2d5ef4bd-d499-4e62-a617-75ed4535c930", "requester_id": "f4fabf3b-ccc1-4117-bb30-de53a9a55d66"}]', 'flavor': '{"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 71, "name": "SOLTEST ... (481 characters truncated) ... "2023-07-04T05:45:02Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}', 'deleted_1': 0, 'instance_uuid_1': 'dd5d0568-1aad-47ed-8418-78a6c75363cc'}]
I validated the count of “PCIRequestRecord” stored in “'pci_requests’”
field of table “instance_extra” and I found that 260 records are stored
which is somewhat equivalent to the number of attach operations
performed on this node before we started seeing the PCI leak as reported
by Mohit in his mail below. This also indicate that “PciRequest” record
for ports created via operator is not getting cleaned up even after that
port is detached/deleted
I was suspecting that this is another issue in openstack when openstack handling the parallel detach requests as received from the operator. I did one exercise to prove that . In my test case, I was having one pod with 2 sriov vnics and I performed the attach/detach operations in a loop then we were hitting the PCI leak issue after multiple attached/detached operations. My hunch was that openstack is responding to a detach request immediately while a detachment of that interface is not completed in backend and nova service is unable to handle another detached operation for sriov ports when other is still in progress which leads to the backed up of one of the openstack table. There was barely a gap between 2 successive detached request sent to the openstack as you can see that we operator is sending the detached request immediately just after the first detached.
To prove this, we introduced a delay of 10 seconds in our code to serialize the detached operations to avoid a possibility where a detached operation is still pending with openstack services for the previous detach. After that I am successfully able to execute the following test cases
• 600 attach/detach operations for a single pod with 2 sriov vnics
• 400 attach/detach operations for 4 pods; each with 2 sriov vnics.
• 320 attach/detach operations for 8 pods; each with 2 sriov vnics.
Therefore, we did ~1300 vnic attach/detach operations and I don’t see
any leak with these changes. The PCI Pool is completely available after
that many attach/detach operations
This proves that openstack is not able to handle the simultaneous detach
operations in yoga release.
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2033247
Title:
PCI Leaks when multiple detach operations performed in parallel
Status in OpenStack Compute (nova):
New
Bug description:
We are using the Openstack yoga release and we need to attach/detach
the port to a VM dynamically. We are observing the PCI leaks while
doing multiple detach operations simultaneously.
PCI starts leaking when one of the openstack table “instance_extra” is
exhausted. Once this table is exhausted then openstack is not able to
attach/detach a port and it starts leaking PCIs due to exception as it
can’t perform any action. This table is used by Openstack to record
the all historical records of “PCIRequest” for the all interfaces
attached with a VM.
DBDataError (pymysql.err.DataError) (1406, "Data too long for column 'pci_requests' at row 1")
[SQL: UPDATE instance_extra SET updated_at=%(updated_at)s, device_metadata=%(device_metadata)s, numa_topology=%(numa_topology)s, pci_requests=%(pci_requests)s, flavor=%(flavor)s WHERE instance_extra.deleted = %(deleted_1)s AND instance_extra.instance_uuid = %(instance_uuid_1)s]
[parameters: {'updated_at': datetime.datetime(2023, 8, 3, 14, 39, 56, 116791), 'device_metadata': '{"nova_object.name": "InstanceDeviceMetadata", "nova_object.namespace": "nova", "nova_object.version": "1.0", "nova_object.data": {"devices": [{"nova ... (6168 characters truncated) ... :00:14.0"}, "nova_object.changes": ["address"]}}, "nova_object.changes": ["bus", "vf_trusted", "mac", "vlan"]}]}, "nova_object.changes": ["devices"]}', 'numa_topology': '{"nova_object.name": "InstanceNUMATopology", "nova_object.namespace": "nova", "nova_object.version": "1.3", "nova_object.data": {"cells": [{"nova_obj ... (946 characters truncated) ... nges": ["id", "cpu_pinning_raw", "cpuset_reserved"]}], "emulator_threads_policy": null}, "nova_object.changes": ["emulator_threads_policy", "cells"]}', 'pci_requests': '[{"count": 1, "spec": [{"physical_network": "sriov3", "remote_managed": "False"}], "alias_name": null, "is_new": false, "numa_policy": null, "request ... (65464 characters truncated) ... "is_new": false, "numa_policy": null, "request_id": "2d5ef4bd-d499-4e62-a617-75ed4535c930", "requester_id": "f4fabf3b-ccc1-4117-bb30-de53a9a55d66"}]', 'flavor': '{"cur": {"nova_object.name": "Flavor", "nova_object.namespace": "nova", "nova_object.version": "1.2", "nova_object.data": {"id": 71, "name": "SOLTEST ... (481 characters truncated) ... "2023-07-04T05:45:02Z", "updated_at": null, "deleted_at": null, "deleted": false}, "nova_object.changes": ["extra_specs"]}, "old": null, "new": null}', 'deleted_1': 0, 'instance_uuid_1': 'dd5d0568-1aad-47ed-8418-78a6c75363cc'}]
I validated the count of “PCIRequestRecord” stored in “'pci_requests’”
field of table “instance_extra” and I found that 260 records are
stored which is somewhat equivalent to the number of attach operations
performed on this node before we started seeing the PCI leak as
reported by Mohit in his mail below. This also indicate that
“PciRequest” record for ports created via operator is not getting
cleaned up even after that port is detached/deleted
I was suspecting that this is another issue in openstack when openstack handling the parallel detach requests as received from the operator. I did one exercise to prove that . In my test case, I was having one pod with 2 sriov vnics and I performed the attach/detach operations in a loop then we were hitting the PCI leak issue after multiple attached/detached operations. My hunch was that openstack is responding to a detach request immediately while a detachment of that interface is not completed in backend and nova service is unable to handle another detached operation for sriov ports when other is still in progress which leads to the backed up of one of the openstack table. There was barely a gap between 2 successive detached request sent to the openstack as you can see that we operator is sending the detached request immediately just after the first detached.
To prove this, we introduced a delay of 10 seconds in our code to serialize the detached operations to avoid a possibility where a detached operation is still pending with openstack services for the previous detach. After that I am successfully able to execute the following test cases
• 600 attach/detach operations for a single pod with 2 sriov vnics
• 400 attach/detach operations for 4 pods; each with 2 sriov vnics.
• 320 attach/detach operations for 8 pods; each with 2 sriov vnics.
Therefore, we did ~1300 vnic attach/detach operations and I don’t see
any leak with these changes. The PCI Pool is completely available
after that many attach/detach operations
This proves that openstack is not able to handle the simultaneous
detach operations in yoga release.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2033247/+subscriptions