← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1807723] Re: swap multiattach volume intermittently fails when servers are on different hosts

 

** No longer affects: cinder

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1807723

Title:
  swap multiattach volume intermittently fails when servers are on
  different hosts

Status in OpenStack Compute (nova):
  Confirmed

Bug description:
  This is found from debugging the
  tempest.api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap
  failures in the (multinode) tempest-slow job here:

  https://review.openstack.org/#/c/606981/

  The failure is ultimately during teardown of the test class, it fails
  to delete the "old" multiattach volume:

  http://logs.openstack.org/81/606981/4/check/tempest-
  slow/fafde23/controller/logs/screen-c-vol.txt.gz?level=TRACE#_Dec_08_01_45_41_380161

  Dec 08 01:45:41.380161 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
  Dec 08 01:45:41.380475 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
  Dec 08 01:45:41.380783 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Exit code: 22
  Dec 08 01:45:41.381094 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stdout: u'Command:\n\ttgtadm -C 0 --mode target --op delete --tid=1\nexited with code: 22.\n'
  Dec 08 01:45:41.381386 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stderr: u'tgtadm: this target is still active\n': ProcessExecutionError: Unexpected error while running command.
  Dec 08 01:45:49.587081 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [None req-9dddc5b9-7e06-4ab9-83a6-ef211e0d4c48 tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
  Dec 08 01:45:49.587355 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8

  After digging through the logs, I can see at least one set of issues
  in the ComputeManager.swap_volume code here:

  https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5798-L5806

  Which is updating the "old" bdm with values from the new volume,
  including updating the bdm.volume_id from the old to the new, which
  likely orphans the old record somehow, or messes up the cleanup.

  My detailed notes follow:

  server1:

  {"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links":
  [{"href":
  "https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-
  1fbad4369a9a", "rel": "self"}, {"href":
  "https://149.202.185.218/compute/servers/d46fba31-9469-4799-b2bf-
  1fbad4369a9a", "rel": "bookmark"}], "image": {"id": "f55b4d12-f159
  -453f-86cb-8965684121c8", "links": [{"href":
  "https://149.202.185.218/compute/images/f55b4d12-f159-453f-86cb-
  8965684121c8", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active",
  "OS-SRV-USG:launched_at": "2018-12-08T01:43:11.000000", "flavor":
  {"ephemeral": 0, "ram": 64, "original_name": "m1.nano", "vcpus": 1,
  "extra_specs": {}, "swap": 0, "disk": 0}, "id": "d46fba31-9469-4799
  -b2bf-1fbad4369a9a", "description": null, "user_id":
  "9bc61597b6794ca49c79ce599c26b223", "OS-DCF:diskConfig": "MANUAL",
  "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-
  STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova",
  "config_drive": "", "status": "ACTIVE", "updated":
  "2018-12-08T01:43:11Z", "hostId":
  "f4600b0fc614ce95d5be30fc1fc82bdb150c47ff8c84b22262be6d19", "OS-SRV-
  USG:terminated_at": null, "tags": [], "key_name": null, "locked":
  false, "name": "tempest-TestMultiAttachVolumeSwap-server-226416218-1",
  "created": "2018-12-08T01:43:04Z", "tenant_id":
  "5db98765c5d94a73bae8aba2f13c7762", "os-extended-
  volumes:volumes_attached": [], "metadata": {}}}

  server1.hostId = f4600b0fc614ce95d5be30fc1fc82bdb150c47ff8c84b22262be6d19
  server1 builds on the subnode (ubuntu-xenial-ovh-gra1-0001066278)

  server2:

  {"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links":
  [{"href": "https://149.202.185.218/compute/v2.1/servers/e0b74695-f9a0
  -4f4c-8a08-a0de3420cc6c", "rel": "self"}, {"href":
  "https://149.202.185.218/compute/servers/e0b74695-f9a0-4f4c-
  8a08-a0de3420cc6c", "rel": "bookmark"}], "image": {"id":
  "f55b4d12-f159-453f-86cb-8965684121c8", "links": [{"href":
  "https://149.202.185.218/compute/images/f55b4d12-f159-453f-86cb-
  8965684121c8", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active",
  "OS-SRV-USG:launched_at": "2018-12-08T01:43:10.000000", "flavor":
  {"ephemeral": 0, "ram": 64, "original_name": "m1.nano", "vcpus": 1,
  "extra_specs": {}, "swap": 0, "disk": 0}, "id": "e0b74695-f9a0-4f4c-
  8a08-a0de3420cc6c", "description": null, "user_id":
  "9bc61597b6794ca49c79ce599c26b223", "OS-DCF:diskConfig": "MANUAL",
  "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-
  STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova",
  "config_drive": "", "status": "ACTIVE", "updated":
  "2018-12-08T01:43:11Z", "hostId":
  "5f1a8abdf427d6cbda8003bd178ef40c605f24f626ec88c3a02816df", "OS-SRV-
  USG:terminated_at": null, "tags": [], "key_name": null, "locked":
  false, "name": "tempest-TestMultiAttachVolumeSwap-server-226416218-2",
  "created": "2018-12-08T01:43:04Z", "tenant_id":
  "5db98765c5d94a73bae8aba2f13c7762", "os-extended-
  volumes:volumes_attached": [], "metadata": {}}}

  server2.hostId = 5f1a8abdf427d6cbda8003bd178ef40c605f24f626ec88c3a02816df
  server2 builds on the primary node (ubuntu-xenial-ovh-gra1-0001066161)

  So clearly the two servers are on different hosts. Also, volume1 is on
  the primary node (same as server2).

  attach volume1 to server1:

  Body: {"volumeAttachment": {"device": "/dev/vdb", "serverId":
  "d46fba31-9469-4799-b2bf-1fbad4369a9a", "id":
  "2a48ae08-2833-488c-8771-eea86f41c7b8", "volumeId":
  "2a48ae08-2833-488c-8771-eea86f41c7b8"}} _log_request_full
  tempest/lib/common/rest_client.py:437

  attach volume1 to server2:

  Body: {"volumeAttachment": {"device": "/dev/vdb", "serverId":
  "e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "id":
  "2a48ae08-2833-488c-8771-eea86f41c7b8", "volumeId":
  "2a48ae08-2833-488c-8771-eea86f41c7b8"}} _log_request_full
  tempest/lib/common/rest_client.py:437

  swap volume1 (2a48ae08-2833-488c-8771-eea86f41c7b8) to volume2
  (fa56e173-3bd4-4701-bc71-cb4810c53c51) on server1 (e0b74695-f9a0-4f4c-
  8a08-a0de3420cc6c):

  2018-12-08 01:43:28.336 8815 INFO tempest.lib.common.rest_client [req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 ] Request (TestMultiAttachVolumeSwap:test_volume_swap_with_multiattach): 202 PUT https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a/os-volume_attachments/2a48ae08-2833-488c-8771-eea86f41c7b8 0.618s
  2018-12-08 01:43:28.337 8815 DEBUG tempest.lib.common.rest_client [req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'X-OpenStack-Nova-API-Version': '2.60'}
          Body: {"volumeAttachment": {"volumeId": "fa56e173-3bd4-4701-bc71-cb4810c53c51"}}
      Response - Headers: {u'x-openstack-request-id': 'req-b6cb2f89-ebf6-47b5-9757-8692dd91c812', u'content-length': '0', u'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', u'server': 'Apache/2.4.18 (Ubuntu)', u'x-compute-request-id': 'req-b6cb2f89-ebf6-47b5-9757-8692dd91c812', u'content-type': 'application/json', 'content-location': 'https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a/os-volume_attachments/2a48ae08-2833-488c-8771-eea86f41c7b8', u'date': 'Sat, 08 Dec 2018 01:43:27 GMT', 'status': '202', u'openstack-api-version': 'compute 2.60', u'connection': 'close', u'x-openstack-nova-api-version': '2.60'}

  now the test waits for the swap to complete and the attachments are all recorded correctly;
  once the swap is done, volume1 should be attached to server2 and volume2 should be attached
  to server1

  This is the connection_info for the old/new volume during the swap:

  Dec 08 01:43:31.321263 ubuntu-xenial-ovh-gra1-0001066278 nova-
  compute[20206]: DEBUG nova.compute.manager [None req-
  b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-
  TestMultiAttachVolumeSwap-477435682 tempest-
  TestMultiAttachVolumeSwap-477435682] [instance: d46fba31-9469-4799
  -b2bf-1fbad4369a9a] swap_volume: Calling driver volume swap with
  connection infos:

  new: {'status': u'reserved', 'multiattach': True, 'detached_at': u'',
  u'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'attach_mode':
  u'null', 'driver_volume_type': u'iscsi', 'instance':
  u'd46fba31-9469-4799-b2bf-1fbad4369a9a', 'attached_at': u'', 'serial':
  u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'data': {u'access_mode':
  u'rw', u'target_discovered': False, u'encrypted': False, u'qos_specs':
  None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-
  fa56e173-3bd4-4701-bc71-cb4810c53c51', u'target_portal':
  u'213.32.74.238:3260', u'volume_id':
  u'fa56e173-3bd4-4701-bc71-cb4810c53c51', u'target_lun': 1,
  u'auth_password': u'***', u'auth_username': u'oAJ38A88Rom7xrwQ5A4Y',
  u'auth_method': u'CHAP'}};

  old: {u'status': u'reserved', u'instance': u'd46fba31-9469-4799-b2bf-
  1fbad4369a9a', u'multiattach': True, u'attached_at': u'',
  u'attach_mode': u'null', u'driver_volume_type': u'iscsi',
  u'detached_at': u'', u'volume_id':
  u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'serial':
  u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'data': {u'device_path':
  u'/dev/sda', u'target_discovered': False, u'encrypted': False,
  u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-
  2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal':
  u'149.202.185.218:3260', u'volume_id':
  u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'auth_password': u'***',
  u'target_lun': 1, u'access_mode': u'rw', u'auth_username':
  u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}} {{(pid=20206)
  _swap_volume /opt/stack/nova/nova/compute/manager.py:5646}}

  When the test cleans up and tries to delete volume1, it fails trying
  to cleanup iqn.2010-10.org.openstack:volume-
  2a48ae08-2833-488c-8771-eea86f41c7b8 from the subnode (which is the
  old connection to server1).

  volume1 connection is supposed to be detached from server1 here:

  http://logs.openstack.org/81/606981/4/check/tempest-
  slow/fafde23/compute1/logs/screen-n-cpu.txt.gz#_Dec_08_01_45_36_699753

  Dec 08 01:45:36.699753 ubuntu-xenial-ovh-gra1-0001066278 nova-
  compute[20206]: DEBUG os_brick.initiator.connectors.iscsi [None req-
  b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-
  TestMultiAttachVolumeSwap-477435682 tempest-
  TestMultiAttachVolumeSwap-477435682] ==> disconnect_volume: call
  u"{'args': (<os_brick.initiator.connectors.iscsi.ISCSIConnector object
  at 0x7fe1aa096d50>, {u'device_path': u'/dev/sda',
  u'target_discovered': False, u'encrypted': False, u'qos_specs': None,
  u'target_iqn': u'iqn.2010-10.org.openstack:volume-
  2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal':
  u'149.202.185.218:3260', u'volume_id':
  u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'auth_password': u'***',
  u'target_lun': 1, u'access_mode': u'rw', u'auth_username':
  u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}, None), 'kwargs':
  {}}" {{(pid=20206) trace_logging_wrapper /usr/local/lib/python2.7
  /dist-packages/os_brick/utils.py:146}}

  It says it logged out:

  Dec 08 01:45:37.336156 ubuntu-xenial-ovh-gra1-0001066278 nova-
  compute[20206]: Logout of [sid: 1, target: iqn.2010-10.org.openstack
  :volume-2a48ae08-2833-488c-8771-eea86f41c7b8, portal:
  149.202.185.218,3260] successful.

  This definitely looks wrong (post-swap volume, updating old BDM record), note it's updating the BDM for volume1 (2a48ae08-2833-488c-8771-eea86f41c7b8) but with the volume2 (fa56e173-3bd4-4701-bc71-cb4810c53c51) connection_info - note that for the new volume attach flow the connection_info might not matter since we likely don't use it (we just work with the attachment records), but if
  the attachment_id (769b75cc-4283-4b46-a975-243faae6e263) is for the wrong volume, that could be a problem:

  Dec 08 01:45:37.699221 ubuntu-xenial-ovh-gra1-0001066278 nova-
  compute[20206]: DEBUG nova.compute.manager [None req-
  b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-
  TestMultiAttachVolumeSwap-477435682 tempest-
  TestMultiAttachVolumeSwap-477435682] [instance: d46fba31-9469-4799
  -b2bf-1fbad4369a9a] swap_volume: Updating volume
  2a48ae08-2833-488c-8771-eea86f41c7b8 BDM record with {'attachment_id':
  u'769b75cc-4283-4b46-a975-243faae6e263', 'no_device': None,
  'connection_info': '{"status": "reserved", "multiattach": true,
  "detached_at": "", "volume_id":
  "fa56e173-3bd4-4701-bc71-cb4810c53c51", "attach_mode": "null",
  "driver_volume_type": "iscsi", "instance": "d46fba31-9469-4799-b2bf-
  1fbad4369a9a", "attached_at": "", "serial":
  "fa56e173-3bd4-4701-bc71-cb4810c53c51", "data": {"access_mode": "rw",
  "target_discovered": false, "encrypted": false, "qos_specs": null,
  "target_iqn": "iqn.2010-10.org.openstack:volume-
  fa56e173-3bd4-4701-bc71-cb4810c53c51", "target_portal":
  "213.32.74.238:3260", "volume_id":
  "fa56e173-3bd4-4701-bc71-cb4810c53c51", "target_lun": 1,
  "device_path": "/dev/sdb", "auth_password": "Ko6DLxWNfrbqWQ7E",
  "auth_username": "oAJ38A88Rom7xrwQ5A4Y", "auth_method": "CHAP"}}',
  'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'source_type':
  'volume', 'snapshot_id': None, 'destination_type': 'volume'}
  {{(pid=20206) swap_volume
  /opt/stack/nova/nova/compute/manager.py:5819}}

  Probably the bigger problem is the BDM for volume1 is updated to have
  volume_id fa56e173-3bd4-4701-bc71-cb4810c53c51 so we changed the
  volume_id on the bdm for volume1, so that might have something to do
  with orphaning volume1's connections when we go to delete - since
  deleting server2 (which still has volume1 attached) likely doesn't
  cleanup volume1 properly.

  The other things I'm wondering about are:

  1. Is nova-compute deleting the old volume1/server1 attachment record
  here:

  https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5691

  because I didn't see anything about attachment delete in the logs.

  2. Is there a periodic task in the cinder-volume service that is
  recreating the target connection for volume1 and server1 in between
  the time we swap the volume and nova-compute deletes the connection
  but before nova-compute deletes the old attachment record?

  The first and most obvious thing to do is fix that bdm update code at
  the end of the ComputeManager.swap_volume method. I'm not exactly sure
  yet how the multi-host nature of this comes into play, except possibly
  in the libvirt driver volume disconnect code here:

  https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/virt/libvirt/driver.py#L1221

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1807723/+subscriptions


References