yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #93189
[Bug 2045921] [NEW] scheduler.utils.set_vm_state_and_notify() race between ERROR state save() and compute_utils.add_instance_fault_from_ex()
Public bug reported:
Maybe "race" isn't the right word, but the ordering of the
Instance.save() of ERROR state and the creation of the instance.fault
record can result in the instance.fault not having been created yet
after the instance is visibly showing ERROR state.
Seen in the gate:
test_qos_min_bw_allocation_basic fails because the expected 'fault' field is missing in the server
Symptom: test_qos_min_bw_allocation_basic fails because the expected 'fault' field is not found in the server from GET /servers/{server_id}
testr_results.html:
testtools.matchers._impl.MismatchError: 'fault' not in {'id':
'19b6c95f-b91b-4949-b72e-3f7fea0d1a49', 'name': 'tempest-
MinBwAllocationPlacementTest-server-1096644195', 'status': 'ERROR',
'tenant_id': '1f52110a4e8649d78861c38daca6f179', 'user_id':
'8ef9e2bc05034b03af2d7323155cb71f', 'metadata': {}, 'hostId': '',
'image': {'id': '3cb38f9c-a86e-47c8-984f-74efc924120c', 'links':
[{'rel': 'bookmark', 'href':
'https://199.19.213.27/compute/images/3cb38f9c-a86e-47c8-984f-74efc924120c'}]},
'flavor': {'vcpus': 1, 'ram': 128, 'disk': 1, 'ephemeral': 0, 'swap': 0,
'original_name': 'm1.nano', 'extra_specs': {'hw_rng:allowed': 'True'}},
'created': '2023-12-07T15:00:24Z', 'updated': '2023-12-07T15:00:30Z',
'addresses': {}, 'accessIPv4': '', 'accessIPv6': '', 'links': [{'rel':
'self', 'href':
'https://199.19.213.27/compute/v2.1/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'},
{'rel': 'bookmark', 'href':
'https://199.19.213.27/compute/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}],
'OS-DCF:diskConfig': 'MANUAL', 'OS-EXT-AZ:availability_zone': '',
'config_drive': '', 'key_name': None, 'OS-SRV-USG:launched_at': None,
'OS-SRV-USG:terminated_at': None, 'OS-EXT-STS:task_state': None, 'OS-
EXT-STS:vm_state': 'error', 'OS-EXT-STS:power_state': 0, 'os-extended-
volumes:volumes_attached': [], 'locked': False, 'description': None,
'tags': [], 'trusted_image_certificates': None, 'server_groups': []}
screen-placement-api.txt:
found no providers with 2147483647 NET_BW_IGR_KILOBIT_PER_SEC
this ^ is expected for this part of the test
OpenSearch query:
message:"testtools.matchers._impl.MismatchError: 'fault' not in"
Comments:
This may be a race because ERROR state is set on the instance and save()'ed before the 'fault' record is created
https://github.com/openstack/nova/blob/302e286408cce2c8df43d6742ca490405a20011d/nova/scheduler/utils.py#L902-L910
and the test waits for ERROR state before checking for the 'fault' field, so maybe sometimes it GETs the instance before the fault was able to be added
https://github.com/openstack/tempest/blob/a0b161bbde6d7734833a26ced76ca44b888fe152/tempest/scenario/test_network_qos_placement.py#L269-L276
Code:
vm_state = updates['vm_state']
properties = request_spec.get('instance_properties', {})
notifier = rpc.get_notifier(service)
state = vm_state.upper()
LOG.warning('Setting instance to %s state.', state,
instance_uuid=instance_uuid)
instance = objects.Instance(context=context, uuid=instance_uuid,
**updates)
instance.obj_reset_changes(['uuid'])
instance.save()
compute_utils.add_instance_fault_from_exc(
context, instance, ex, sys.exc_info())
I wonder if it would be legit to swap the ordering to do compute_utils.add_instance_fault_from_exc() before instance.save() of ERROR state?
** Affects: nova
Importance: Undecided
Status: New
** Tags: gate-failure
** Description changed:
Maybe "race" isn't the right word, but the ordering of the
Instance.save() of ERROR state and the creation of the instance.fault
record can result in the instance.fault not having been created yet
after the instance is visibly showing ERROR state.
Seen in the gate:
test_qos_min_bw_allocation_basic fails because the expected 'fault' field is missing in the server
Symptom: test_qos_min_bw_allocation_basic fails because the expected 'fault' field is not found in the server from GET /servers/{server_id}
testr_results.html:
testtools.matchers._impl.MismatchError: 'fault' not in {'id':
'19b6c95f-b91b-4949-b72e-3f7fea0d1a49', 'name': 'tempest-
MinBwAllocationPlacementTest-server-1096644195', 'status': 'ERROR',
'tenant_id': '1f52110a4e8649d78861c38daca6f179', 'user_id':
'8ef9e2bc05034b03af2d7323155cb71f', 'metadata': {}, 'hostId': '',
'image': {'id': '3cb38f9c-a86e-47c8-984f-74efc924120c', 'links':
[{'rel': 'bookmark', 'href':
'https://199.19.213.27/compute/images/3cb38f9c-a86e-47c8-984f-74efc924120c'}]},
'flavor': {'vcpus': 1, 'ram': 128, 'disk': 1, 'ephemeral': 0, 'swap': 0,
'original_name': 'm1.nano', 'extra_specs': {'hw_rng:allowed': 'True'}},
'created': '2023-12-07T15:00:24Z', 'updated': '2023-12-07T15:00:30Z',
'addresses': {}, 'accessIPv4': '', 'accessIPv6': '', 'links': [{'rel':
'self', 'href':
'https://199.19.213.27/compute/v2.1/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'},
{'rel': 'bookmark', 'href':
'https://199.19.213.27/compute/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}],
'OS-DCF:diskConfig': 'MANUAL', 'OS-EXT-AZ:availability_zone': '',
'config_drive': '', 'key_name': None, 'OS-SRV-USG:launched_at': None,
'OS-SRV-USG:terminated_at': None, 'OS-EXT-STS:task_state': None, 'OS-
EXT-STS:vm_state': 'error', 'OS-EXT-STS:power_state': 0, 'os-extended-
volumes:volumes_attached': [], 'locked': False, 'description': None,
'tags': [], 'trusted_image_certificates': None, 'server_groups': []}
screen-placement-api.txt:
found no providers with 2147483647 NET_BW_IGR_KILOBIT_PER_SEC
this ^ is expected for this part of the test
OpenSearch query:
message:"testtools.matchers._impl.MismatchError: 'fault' not in"
Comments:
This may be a race because ERROR state is set on the instance and save()'ed before the 'fault' record is created
https://github.com/openstack/nova/blob/302e286408cce2c8df43d6742ca490405a20011d/nova/scheduler/utils.py#L902-L910
and the test waits for ERROR state before checking for the 'fault' field, so maybe sometimes it GETs the instance before the fault was able to be added
https://github.com/openstack/tempest/blob/a0b161bbde6d7734833a26ced76ca44b888fe152/tempest/scenario/test_network_qos_placement.py#L269-L276
+
+ Code:
+
+ vm_state = updates['vm_state']
+ properties = request_spec.get('instance_properties', {})
+ notifier = rpc.get_notifier(service)
+ state = vm_state.upper()
+ LOG.warning('Setting instance to %s state.', state,
+ instance_uuid=instance_uuid)
+
+ instance = objects.Instance(context=context, uuid=instance_uuid,
+ **updates)
+ instance.obj_reset_changes(['uuid'])
+ instance.save()
+ compute_utils.add_instance_fault_from_exc(
+ context, instance, ex, sys.exc_info())
+
+
+ I wonder if it would be legit to swap the ordering to do compute_utils.add_instance_fault_from_exc() before instance.save() of ERROR state?
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2045921
Title:
scheduler.utils.set_vm_state_and_notify() race between ERROR state
save() and compute_utils.add_instance_fault_from_ex()
Status in OpenStack Compute (nova):
New
Bug description:
Maybe "race" isn't the right word, but the ordering of the
Instance.save() of ERROR state and the creation of the instance.fault
record can result in the instance.fault not having been created yet
after the instance is visibly showing ERROR state.
Seen in the gate:
test_qos_min_bw_allocation_basic fails because the expected 'fault' field is missing in the server
Symptom: test_qos_min_bw_allocation_basic fails because the expected 'fault' field is not found in the server from GET /servers/{server_id}
testr_results.html:
testtools.matchers._impl.MismatchError: 'fault' not in {'id':
'19b6c95f-b91b-4949-b72e-3f7fea0d1a49', 'name': 'tempest-
MinBwAllocationPlacementTest-server-1096644195', 'status': 'ERROR',
'tenant_id': '1f52110a4e8649d78861c38daca6f179', 'user_id':
'8ef9e2bc05034b03af2d7323155cb71f', 'metadata': {}, 'hostId': '',
'image': {'id': '3cb38f9c-a86e-47c8-984f-74efc924120c', 'links':
[{'rel': 'bookmark', 'href':
'https://199.19.213.27/compute/images/3cb38f9c-a86e-47c8-984f-74efc924120c'}]},
'flavor': {'vcpus': 1, 'ram': 128, 'disk': 1, 'ephemeral': 0, 'swap':
0, 'original_name': 'm1.nano', 'extra_specs': {'hw_rng:allowed':
'True'}}, 'created': '2023-12-07T15:00:24Z', 'updated':
'2023-12-07T15:00:30Z', 'addresses': {}, 'accessIPv4': '',
'accessIPv6': '', 'links': [{'rel': 'self', 'href':
'https://199.19.213.27/compute/v2.1/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'},
{'rel': 'bookmark', 'href':
'https://199.19.213.27/compute/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}],
'OS-DCF:diskConfig': 'MANUAL', 'OS-EXT-AZ:availability_zone': '',
'config_drive': '', 'key_name': None, 'OS-SRV-USG:launched_at': None,
'OS-SRV-USG:terminated_at': None, 'OS-EXT-STS:task_state': None, 'OS-
EXT-STS:vm_state': 'error', 'OS-EXT-STS:power_state': 0, 'os-extended-
volumes:volumes_attached': [], 'locked': False, 'description': None,
'tags': [], 'trusted_image_certificates': None, 'server_groups': []}
screen-placement-api.txt:
found no providers with 2147483647 NET_BW_IGR_KILOBIT_PER_SEC
this ^ is expected for this part of the test
OpenSearch query:
message:"testtools.matchers._impl.MismatchError: 'fault' not in"
Comments:
This may be a race because ERROR state is set on the instance and save()'ed before the 'fault' record is created
https://github.com/openstack/nova/blob/302e286408cce2c8df43d6742ca490405a20011d/nova/scheduler/utils.py#L902-L910
and the test waits for ERROR state before checking for the 'fault' field, so maybe sometimes it GETs the instance before the fault was able to be added
https://github.com/openstack/tempest/blob/a0b161bbde6d7734833a26ced76ca44b888fe152/tempest/scenario/test_network_qos_placement.py#L269-L276
Code:
vm_state = updates['vm_state']
properties = request_spec.get('instance_properties', {})
notifier = rpc.get_notifier(service)
state = vm_state.upper()
LOG.warning('Setting instance to %s state.', state,
instance_uuid=instance_uuid)
instance = objects.Instance(context=context, uuid=instance_uuid,
**updates)
instance.obj_reset_changes(['uuid'])
instance.save()
compute_utils.add_instance_fault_from_exc(
context, instance, ex, sys.exc_info())
I wonder if it would be legit to swap the ordering to do compute_utils.add_instance_fault_from_exc() before instance.save() of ERROR state?
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2045921/+subscriptions