yahoo-eng-team team mailing list archive

Thread
Date
[Bug 2045921] [NEW] scheduler.utils.set_vm_state_and_notify() race between ERROR state save() and compute_utils.add_instance_fault_from_ex()

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: melanie witt <2045921@xxxxxxxxxxxxxxxxxx>
Date: Thu, 07 Dec 2023 18:11:14 -0000
Reply-to: Bug 2045921 <2045921@xxxxxxxxxxxxxxxxxx>
Sender: noreply@xxxxxxxxxxxxx
Public bug reported:

Maybe "race" isn't the right word, but the ordering of the
Instance.save() of ERROR state and the creation of the instance.fault
record can result in the instance.fault not having been created yet
after the instance is visibly showing ERROR state.

Seen in the gate:

test_qos_min_bw_allocation_basic fails because the expected 'fault' field is missing in the server
Symptom: test_qos_min_bw_allocation_basic fails because the expected 'fault' field is not found in the server from GET /servers/{server_id}

testr_results.html:

testtools.matchers._impl.MismatchError: 'fault' not in {'id':
'19b6c95f-b91b-4949-b72e-3f7fea0d1a49', 'name': 'tempest-
MinBwAllocationPlacementTest-server-1096644195', 'status': 'ERROR',
'tenant_id': '1f52110a4e8649d78861c38daca6f179', 'user_id':
'8ef9e2bc05034b03af2d7323155cb71f', 'metadata': {}, 'hostId': '',
'image': {'id': '3cb38f9c-a86e-47c8-984f-74efc924120c', 'links':
[{'rel': 'bookmark', 'href':
'https://199.19.213.27/compute/images/3cb38f9c-a86e-47c8-984f-74efc924120c'}]},
'flavor': {'vcpus': 1, 'ram': 128, 'disk': 1, 'ephemeral': 0, 'swap': 0,
'original_name': 'm1.nano', 'extra_specs': {'hw_rng:allowed': 'True'}},
'created': '2023-12-07T15:00:24Z', 'updated': '2023-12-07T15:00:30Z',
'addresses': {}, 'accessIPv4': '', 'accessIPv6': '', 'links': [{'rel':
'self', 'href':
'https://199.19.213.27/compute/v2.1/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'},
{'rel': 'bookmark', 'href':
'https://199.19.213.27/compute/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}],
'OS-DCF:diskConfig': 'MANUAL', 'OS-EXT-AZ:availability_zone': '',
'config_drive': '', 'key_name': None, 'OS-SRV-USG:launched_at': None,
'OS-SRV-USG:terminated_at': None, 'OS-EXT-STS:task_state': None, 'OS-
EXT-STS:vm_state': 'error', 'OS-EXT-STS:power_state': 0, 'os-extended-
volumes:volumes_attached': [], 'locked': False, 'description': None,
'tags': [], 'trusted_image_certificates': None, 'server_groups': []}

screen-placement-api.txt:

found no providers with 2147483647 NET_BW_IGR_KILOBIT_PER_SEC
this ^ is expected for this part of the test

OpenSearch query:

message:"testtools.matchers._impl.MismatchError: 'fault' not in"

Comments:

This may be a race because ERROR state is set on the instance and save()'ed before the 'fault' record is created
https://github.com/openstack/nova/blob/302e286408cce2c8df43d6742ca490405a20011d/nova/scheduler/utils.py#L902-L910
and the test waits for ERROR state before checking for the 'fault' field, so maybe sometimes it GETs the instance before the fault was able to be added
https://github.com/openstack/tempest/blob/a0b161bbde6d7734833a26ced76ca44b888fe152/tempest/scenario/test_network_qos_placement.py#L269-L276

Code:

    vm_state = updates['vm_state']
    properties = request_spec.get('instance_properties', {})
    notifier = rpc.get_notifier(service)
    state = vm_state.upper()
    LOG.warning('Setting instance to %s state.', state,
                instance_uuid=instance_uuid)

    instance = objects.Instance(context=context, uuid=instance_uuid,
                                **updates)
    instance.obj_reset_changes(['uuid'])
    instance.save()
    compute_utils.add_instance_fault_from_exc(
        context, instance, ex, sys.exc_info())


I wonder if it would be legit to swap the ordering to do compute_utils.add_instance_fault_from_exc() before instance.save() of ERROR state?

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: gate-failure

** Description changed:

  Maybe "race" isn't the right word, but the ordering of the
  Instance.save() of ERROR state and the creation of the instance.fault
  record can result in the instance.fault not having been created yet
  after the instance is visibly showing ERROR state.
  
  Seen in the gate:
  
  test_qos_min_bw_allocation_basic fails because the expected 'fault' field is missing in the server
  Symptom: test_qos_min_bw_allocation_basic fails because the expected 'fault' field is not found in the server from GET /servers/{server_id}
  
  testr_results.html:
  
  testtools.matchers._impl.MismatchError: 'fault' not in {'id':
  '19b6c95f-b91b-4949-b72e-3f7fea0d1a49', 'name': 'tempest-
  MinBwAllocationPlacementTest-server-1096644195', 'status': 'ERROR',
  'tenant_id': '1f52110a4e8649d78861c38daca6f179', 'user_id':
  '8ef9e2bc05034b03af2d7323155cb71f', 'metadata': {}, 'hostId': '',
  'image': {'id': '3cb38f9c-a86e-47c8-984f-74efc924120c', 'links':
  [{'rel': 'bookmark', 'href':
  'https://199.19.213.27/compute/images/3cb38f9c-a86e-47c8-984f-74efc924120c'}]},
  'flavor': {'vcpus': 1, 'ram': 128, 'disk': 1, 'ephemeral': 0, 'swap': 0,
  'original_name': 'm1.nano', 'extra_specs': {'hw_rng:allowed': 'True'}},
  'created': '2023-12-07T15:00:24Z', 'updated': '2023-12-07T15:00:30Z',
  'addresses': {}, 'accessIPv4': '', 'accessIPv6': '', 'links': [{'rel':
  'self', 'href':
  'https://199.19.213.27/compute/v2.1/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'},
  {'rel': 'bookmark', 'href':
  'https://199.19.213.27/compute/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}],
  'OS-DCF:diskConfig': 'MANUAL', 'OS-EXT-AZ:availability_zone': '',
  'config_drive': '', 'key_name': None, 'OS-SRV-USG:launched_at': None,
  'OS-SRV-USG:terminated_at': None, 'OS-EXT-STS:task_state': None, 'OS-
  EXT-STS:vm_state': 'error', 'OS-EXT-STS:power_state': 0, 'os-extended-
  volumes:volumes_attached': [], 'locked': False, 'description': None,
  'tags': [], 'trusted_image_certificates': None, 'server_groups': []}
  
  screen-placement-api.txt:
  
  found no providers with 2147483647 NET_BW_IGR_KILOBIT_PER_SEC
  this ^ is expected for this part of the test
  
  OpenSearch query:
  
  message:"testtools.matchers._impl.MismatchError: 'fault' not in"
  
  Comments:
  
  This may be a race because ERROR state is set on the instance and save()'ed before the 'fault' record is created
  https://github.com/openstack/nova/blob/302e286408cce2c8df43d6742ca490405a20011d/nova/scheduler/utils.py#L902-L910
  and the test waits for ERROR state before checking for the 'fault' field, so maybe sometimes it GETs the instance before the fault was able to be added
  https://github.com/openstack/tempest/blob/a0b161bbde6d7734833a26ced76ca44b888fe152/tempest/scenario/test_network_qos_placement.py#L269-L276
+ 
+ Code:
+ 
+     vm_state = updates['vm_state']
+     properties = request_spec.get('instance_properties', {})
+     notifier = rpc.get_notifier(service)
+     state = vm_state.upper()
+     LOG.warning('Setting instance to %s state.', state,
+                 instance_uuid=instance_uuid)
+ 
+     instance = objects.Instance(context=context, uuid=instance_uuid,
+                                 **updates)
+     instance.obj_reset_changes(['uuid'])
+     instance.save()
+     compute_utils.add_instance_fault_from_exc(
+         context, instance, ex, sys.exc_info())
+ 
+ 
+ I wonder if it would be legit to swap the ordering to do compute_utils.add_instance_fault_from_exc() before instance.save() of ERROR state?

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2045921

Title:
  scheduler.utils.set_vm_state_and_notify() race between ERROR state
  save() and compute_utils.add_instance_fault_from_ex()

Status in OpenStack Compute (nova):
  New

Bug description:
  Maybe "race" isn't the right word, but the ordering of the
  Instance.save() of ERROR state and the creation of the instance.fault
  record can result in the instance.fault not having been created yet
  after the instance is visibly showing ERROR state.

  Seen in the gate:

  test_qos_min_bw_allocation_basic fails because the expected 'fault' field is missing in the server
  Symptom: test_qos_min_bw_allocation_basic fails because the expected 'fault' field is not found in the server from GET /servers/{server_id}

  testr_results.html:

  testtools.matchers._impl.MismatchError: 'fault' not in {'id':
  '19b6c95f-b91b-4949-b72e-3f7fea0d1a49', 'name': 'tempest-
  MinBwAllocationPlacementTest-server-1096644195', 'status': 'ERROR',
  'tenant_id': '1f52110a4e8649d78861c38daca6f179', 'user_id':
  '8ef9e2bc05034b03af2d7323155cb71f', 'metadata': {}, 'hostId': '',
  'image': {'id': '3cb38f9c-a86e-47c8-984f-74efc924120c', 'links':
  [{'rel': 'bookmark', 'href':
  'https://199.19.213.27/compute/images/3cb38f9c-a86e-47c8-984f-74efc924120c'}]},
  'flavor': {'vcpus': 1, 'ram': 128, 'disk': 1, 'ephemeral': 0, 'swap':
  0, 'original_name': 'm1.nano', 'extra_specs': {'hw_rng:allowed':
  'True'}}, 'created': '2023-12-07T15:00:24Z', 'updated':
  '2023-12-07T15:00:30Z', 'addresses': {}, 'accessIPv4': '',
  'accessIPv6': '', 'links': [{'rel': 'self', 'href':
  'https://199.19.213.27/compute/v2.1/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'},
  {'rel': 'bookmark', 'href':
  'https://199.19.213.27/compute/servers/19b6c95f-b91b-4949-b72e-3f7fea0d1a49'}],
  'OS-DCF:diskConfig': 'MANUAL', 'OS-EXT-AZ:availability_zone': '',
  'config_drive': '', 'key_name': None, 'OS-SRV-USG:launched_at': None,
  'OS-SRV-USG:terminated_at': None, 'OS-EXT-STS:task_state': None, 'OS-
  EXT-STS:vm_state': 'error', 'OS-EXT-STS:power_state': 0, 'os-extended-
  volumes:volumes_attached': [], 'locked': False, 'description': None,
  'tags': [], 'trusted_image_certificates': None, 'server_groups': []}

  screen-placement-api.txt:

  found no providers with 2147483647 NET_BW_IGR_KILOBIT_PER_SEC
  this ^ is expected for this part of the test

  OpenSearch query:

  message:"testtools.matchers._impl.MismatchError: 'fault' not in"

  Comments:

  This may be a race because ERROR state is set on the instance and save()'ed before the 'fault' record is created
  https://github.com/openstack/nova/blob/302e286408cce2c8df43d6742ca490405a20011d/nova/scheduler/utils.py#L902-L910
  and the test waits for ERROR state before checking for the 'fault' field, so maybe sometimes it GETs the instance before the fault was able to be added
  https://github.com/openstack/tempest/blob/a0b161bbde6d7734833a26ced76ca44b888fe152/tempest/scenario/test_network_qos_placement.py#L269-L276

  Code:

      vm_state = updates['vm_state']
      properties = request_spec.get('instance_properties', {})
      notifier = rpc.get_notifier(service)
      state = vm_state.upper()
      LOG.warning('Setting instance to %s state.', state,
                  instance_uuid=instance_uuid)

      instance = objects.Instance(context=context, uuid=instance_uuid,
                                  **updates)
      instance.obj_reset_changes(['uuid'])
      instance.save()
      compute_utils.add_instance_fault_from_exc(
          context, instance, ex, sys.exc_info())

  
  I wonder if it would be legit to swap the ordering to do compute_utils.add_instance_fault_from_exc() before instance.save() of ERROR state?

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2045921/+subscriptions