yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #76173
[Bug 1807723] Re: swap multiattach volume intermittently fails when servers are on different hosts
** No longer affects: cinder
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1807723
Title:
swap multiattach volume intermittently fails when servers are on
different hosts
Status in OpenStack Compute (nova):
Confirmed
Bug description:
This is found from debugging the
tempest.api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap
failures in the (multinode) tempest-slow job here:
https://review.openstack.org/#/c/606981/
The failure is ultimately during teardown of the test class, it fails
to delete the "old" multiattach volume:
http://logs.openstack.org/81/606981/4/check/tempest-
slow/fafde23/controller/logs/screen-c-vol.txt.gz?level=TRACE#_Dec_08_01_45_41_380161
Dec 08 01:45:41.380161 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
Dec 08 01:45:41.380475 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.380783 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Exit code: 22
Dec 08 01:45:41.381094 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stdout: u'Command:\n\ttgtadm -C 0 --mode target --op delete --tid=1\nexited with code: 22.\n'
Dec 08 01:45:41.381386 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stderr: u'tgtadm: this target is still active\n': ProcessExecutionError: Unexpected error while running command.
Dec 08 01:45:49.587081 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [None req-9dddc5b9-7e06-4ab9-83a6-ef211e0d4c48 tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
Dec 08 01:45:49.587355 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
After digging through the logs, I can see at least one set of issues
in the ComputeManager.swap_volume code here:
https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5798-L5806
Which is updating the "old" bdm with values from the new volume,
including updating the bdm.volume_id from the old to the new, which
likely orphans the old record somehow, or messes up the cleanup.
My detailed notes follow:
server1:
{"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links":
[{"href":
"https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-
1fbad4369a9a", "rel": "self"}, {"href":
"https://149.202.185.218/compute/servers/d46fba31-9469-4799-b2bf-
1fbad4369a9a", "rel": "bookmark"}], "image": {"id": "f55b4d12-f159
-453f-86cb-8965684121c8", "links": [{"href":
"https://149.202.185.218/compute/images/f55b4d12-f159-453f-86cb-
8965684121c8", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active",
"OS-SRV-USG:launched_at": "2018-12-08T01:43:11.000000", "flavor":
{"ephemeral": 0, "ram": 64, "original_name": "m1.nano", "vcpus": 1,
"extra_specs": {}, "swap": 0, "disk": 0}, "id": "d46fba31-9469-4799
-b2bf-1fbad4369a9a", "description": null, "user_id":
"9bc61597b6794ca49c79ce599c26b223", "OS-DCF:diskConfig": "MANUAL",
"accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-
STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova",
"config_drive": "", "status": "ACTIVE", "updated":
"2018-12-08T01:43:11Z", "hostId":
"f4600b0fc614ce95d5be30fc1fc82bdb150c47ff8c84b22262be6d19", "OS-SRV-
USG:terminated_at": null, "tags": [], "key_name": null, "locked":
false, "name": "tempest-TestMultiAttachVolumeSwap-server-226416218-1",
"created": "2018-12-08T01:43:04Z", "tenant_id":
"5db98765c5d94a73bae8aba2f13c7762", "os-extended-
volumes:volumes_attached": [], "metadata": {}}}
server1.hostId = f4600b0fc614ce95d5be30fc1fc82bdb150c47ff8c84b22262be6d19
server1 builds on the subnode (ubuntu-xenial-ovh-gra1-0001066278)
server2:
{"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links":
[{"href": "https://149.202.185.218/compute/v2.1/servers/e0b74695-f9a0
-4f4c-8a08-a0de3420cc6c", "rel": "self"}, {"href":
"https://149.202.185.218/compute/servers/e0b74695-f9a0-4f4c-
8a08-a0de3420cc6c", "rel": "bookmark"}], "image": {"id":
"f55b4d12-f159-453f-86cb-8965684121c8", "links": [{"href":
"https://149.202.185.218/compute/images/f55b4d12-f159-453f-86cb-
8965684121c8", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active",
"OS-SRV-USG:launched_at": "2018-12-08T01:43:10.000000", "flavor":
{"ephemeral": 0, "ram": 64, "original_name": "m1.nano", "vcpus": 1,
"extra_specs": {}, "swap": 0, "disk": 0}, "id": "e0b74695-f9a0-4f4c-
8a08-a0de3420cc6c", "description": null, "user_id":
"9bc61597b6794ca49c79ce599c26b223", "OS-DCF:diskConfig": "MANUAL",
"accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-
STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova",
"config_drive": "", "status": "ACTIVE", "updated":
"2018-12-08T01:43:11Z", "hostId":
"5f1a8abdf427d6cbda8003bd178ef40c605f24f626ec88c3a02816df", "OS-SRV-
USG:terminated_at": null, "tags": [], "key_name": null, "locked":
false, "name": "tempest-TestMultiAttachVolumeSwap-server-226416218-2",
"created": "2018-12-08T01:43:04Z", "tenant_id":
"5db98765c5d94a73bae8aba2f13c7762", "os-extended-
volumes:volumes_attached": [], "metadata": {}}}
server2.hostId = 5f1a8abdf427d6cbda8003bd178ef40c605f24f626ec88c3a02816df
server2 builds on the primary node (ubuntu-xenial-ovh-gra1-0001066161)
So clearly the two servers are on different hosts. Also, volume1 is on
the primary node (same as server2).
attach volume1 to server1:
Body: {"volumeAttachment": {"device": "/dev/vdb", "serverId":
"d46fba31-9469-4799-b2bf-1fbad4369a9a", "id":
"2a48ae08-2833-488c-8771-eea86f41c7b8", "volumeId":
"2a48ae08-2833-488c-8771-eea86f41c7b8"}} _log_request_full
tempest/lib/common/rest_client.py:437
attach volume1 to server2:
Body: {"volumeAttachment": {"device": "/dev/vdb", "serverId":
"e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "id":
"2a48ae08-2833-488c-8771-eea86f41c7b8", "volumeId":
"2a48ae08-2833-488c-8771-eea86f41c7b8"}} _log_request_full
tempest/lib/common/rest_client.py:437
swap volume1 (2a48ae08-2833-488c-8771-eea86f41c7b8) to volume2
(fa56e173-3bd4-4701-bc71-cb4810c53c51) on server1 (e0b74695-f9a0-4f4c-
8a08-a0de3420cc6c):
2018-12-08 01:43:28.336 8815 INFO tempest.lib.common.rest_client [req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 ] Request (TestMultiAttachVolumeSwap:test_volume_swap_with_multiattach): 202 PUT https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a/os-volume_attachments/2a48ae08-2833-488c-8771-eea86f41c7b8 0.618s
2018-12-08 01:43:28.337 8815 DEBUG tempest.lib.common.rest_client [req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'X-OpenStack-Nova-API-Version': '2.60'}
Body: {"volumeAttachment": {"volumeId": "fa56e173-3bd4-4701-bc71-cb4810c53c51"}}
Response - Headers: {u'x-openstack-request-id': 'req-b6cb2f89-ebf6-47b5-9757-8692dd91c812', u'content-length': '0', u'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', u'server': 'Apache/2.4.18 (Ubuntu)', u'x-compute-request-id': 'req-b6cb2f89-ebf6-47b5-9757-8692dd91c812', u'content-type': 'application/json', 'content-location': 'https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a/os-volume_attachments/2a48ae08-2833-488c-8771-eea86f41c7b8', u'date': 'Sat, 08 Dec 2018 01:43:27 GMT', 'status': '202', u'openstack-api-version': 'compute 2.60', u'connection': 'close', u'x-openstack-nova-api-version': '2.60'}
now the test waits for the swap to complete and the attachments are all recorded correctly;
once the swap is done, volume1 should be attached to server2 and volume2 should be attached
to server1
This is the connection_info for the old/new volume during the swap:
Dec 08 01:43:31.321263 ubuntu-xenial-ovh-gra1-0001066278 nova-
compute[20206]: DEBUG nova.compute.manager [None req-
b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-
TestMultiAttachVolumeSwap-477435682 tempest-
TestMultiAttachVolumeSwap-477435682] [instance: d46fba31-9469-4799
-b2bf-1fbad4369a9a] swap_volume: Calling driver volume swap with
connection infos:
new: {'status': u'reserved', 'multiattach': True, 'detached_at': u'',
u'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'attach_mode':
u'null', 'driver_volume_type': u'iscsi', 'instance':
u'd46fba31-9469-4799-b2bf-1fbad4369a9a', 'attached_at': u'', 'serial':
u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'data': {u'access_mode':
u'rw', u'target_discovered': False, u'encrypted': False, u'qos_specs':
None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-
fa56e173-3bd4-4701-bc71-cb4810c53c51', u'target_portal':
u'213.32.74.238:3260', u'volume_id':
u'fa56e173-3bd4-4701-bc71-cb4810c53c51', u'target_lun': 1,
u'auth_password': u'***', u'auth_username': u'oAJ38A88Rom7xrwQ5A4Y',
u'auth_method': u'CHAP'}};
old: {u'status': u'reserved', u'instance': u'd46fba31-9469-4799-b2bf-
1fbad4369a9a', u'multiattach': True, u'attached_at': u'',
u'attach_mode': u'null', u'driver_volume_type': u'iscsi',
u'detached_at': u'', u'volume_id':
u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'serial':
u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'data': {u'device_path':
u'/dev/sda', u'target_discovered': False, u'encrypted': False,
u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-
2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal':
u'149.202.185.218:3260', u'volume_id':
u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'auth_password': u'***',
u'target_lun': 1, u'access_mode': u'rw', u'auth_username':
u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}} {{(pid=20206)
_swap_volume /opt/stack/nova/nova/compute/manager.py:5646}}
When the test cleans up and tries to delete volume1, it fails trying
to cleanup iqn.2010-10.org.openstack:volume-
2a48ae08-2833-488c-8771-eea86f41c7b8 from the subnode (which is the
old connection to server1).
volume1 connection is supposed to be detached from server1 here:
http://logs.openstack.org/81/606981/4/check/tempest-
slow/fafde23/compute1/logs/screen-n-cpu.txt.gz#_Dec_08_01_45_36_699753
Dec 08 01:45:36.699753 ubuntu-xenial-ovh-gra1-0001066278 nova-
compute[20206]: DEBUG os_brick.initiator.connectors.iscsi [None req-
b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-
TestMultiAttachVolumeSwap-477435682 tempest-
TestMultiAttachVolumeSwap-477435682] ==> disconnect_volume: call
u"{'args': (<os_brick.initiator.connectors.iscsi.ISCSIConnector object
at 0x7fe1aa096d50>, {u'device_path': u'/dev/sda',
u'target_discovered': False, u'encrypted': False, u'qos_specs': None,
u'target_iqn': u'iqn.2010-10.org.openstack:volume-
2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal':
u'149.202.185.218:3260', u'volume_id':
u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'auth_password': u'***',
u'target_lun': 1, u'access_mode': u'rw', u'auth_username':
u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}, None), 'kwargs':
{}}" {{(pid=20206) trace_logging_wrapper /usr/local/lib/python2.7
/dist-packages/os_brick/utils.py:146}}
It says it logged out:
Dec 08 01:45:37.336156 ubuntu-xenial-ovh-gra1-0001066278 nova-
compute[20206]: Logout of [sid: 1, target: iqn.2010-10.org.openstack
:volume-2a48ae08-2833-488c-8771-eea86f41c7b8, portal:
149.202.185.218,3260] successful.
This definitely looks wrong (post-swap volume, updating old BDM record), note it's updating the BDM for volume1 (2a48ae08-2833-488c-8771-eea86f41c7b8) but with the volume2 (fa56e173-3bd4-4701-bc71-cb4810c53c51) connection_info - note that for the new volume attach flow the connection_info might not matter since we likely don't use it (we just work with the attachment records), but if
the attachment_id (769b75cc-4283-4b46-a975-243faae6e263) is for the wrong volume, that could be a problem:
Dec 08 01:45:37.699221 ubuntu-xenial-ovh-gra1-0001066278 nova-
compute[20206]: DEBUG nova.compute.manager [None req-
b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-
TestMultiAttachVolumeSwap-477435682 tempest-
TestMultiAttachVolumeSwap-477435682] [instance: d46fba31-9469-4799
-b2bf-1fbad4369a9a] swap_volume: Updating volume
2a48ae08-2833-488c-8771-eea86f41c7b8 BDM record with {'attachment_id':
u'769b75cc-4283-4b46-a975-243faae6e263', 'no_device': None,
'connection_info': '{"status": "reserved", "multiattach": true,
"detached_at": "", "volume_id":
"fa56e173-3bd4-4701-bc71-cb4810c53c51", "attach_mode": "null",
"driver_volume_type": "iscsi", "instance": "d46fba31-9469-4799-b2bf-
1fbad4369a9a", "attached_at": "", "serial":
"fa56e173-3bd4-4701-bc71-cb4810c53c51", "data": {"access_mode": "rw",
"target_discovered": false, "encrypted": false, "qos_specs": null,
"target_iqn": "iqn.2010-10.org.openstack:volume-
fa56e173-3bd4-4701-bc71-cb4810c53c51", "target_portal":
"213.32.74.238:3260", "volume_id":
"fa56e173-3bd4-4701-bc71-cb4810c53c51", "target_lun": 1,
"device_path": "/dev/sdb", "auth_password": "Ko6DLxWNfrbqWQ7E",
"auth_username": "oAJ38A88Rom7xrwQ5A4Y", "auth_method": "CHAP"}}',
'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'source_type':
'volume', 'snapshot_id': None, 'destination_type': 'volume'}
{{(pid=20206) swap_volume
/opt/stack/nova/nova/compute/manager.py:5819}}
Probably the bigger problem is the BDM for volume1 is updated to have
volume_id fa56e173-3bd4-4701-bc71-cb4810c53c51 so we changed the
volume_id on the bdm for volume1, so that might have something to do
with orphaning volume1's connections when we go to delete - since
deleting server2 (which still has volume1 attached) likely doesn't
cleanup volume1 properly.
The other things I'm wondering about are:
1. Is nova-compute deleting the old volume1/server1 attachment record
here:
https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5691
because I didn't see anything about attachment delete in the logs.
2. Is there a periodic task in the cinder-volume service that is
recreating the target connection for volume1 and server1 in between
the time we swap the volume and nova-compute deletes the connection
but before nova-compute deletes the old attachment record?
The first and most obvious thing to do is fix that bdm update code at
the end of the ComputeManager.swap_volume method. I'm not exactly sure
yet how the multi-host nature of this comes into play, except possibly
in the libvirt driver volume disconnect code here:
https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/virt/libvirt/driver.py#L1221
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1807723/+subscriptions
References