yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1681998] [NEW] Bypass the dirty BDM enty no matter how it is produced

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Hua Zhang <joshua.zhang@xxxxxxxxxxxxx>
Date: Wed, 12 Apr 2017 03:50:33 -0000
Reply-to: Bug 1681998 <1681998@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Sometimes the following dirty BDM enty (1.row) can be seen in the
database that multiple BDMs with the same image_id and instance_uuid.

mysql> select * from block_device_mapping where volume_id='153bcab4-1f88-440c-9782-3c661a7502a8' \G
*************************** 1. row ***************************
           created_at: 2017-02-02 02:28:45
           updated_at: NULL
           deleted_at: NULL
                   id: 9754
          device_name: /dev/vdb
delete_on_termination: 0
          snapshot_id: NULL
            volume_id: 153bcab4-1f88-440c-9782-3c661a7502a8
          volume_size: NULL
            no_device: NULL
      connection_info: NULL
        instance_uuid: b52f9264-d8b3-406a-bf9b-d7d7471b13fc
              deleted: 0
          source_type: volume
     destination_type: volume
         guest_format: NULL
          device_type: NULL
             disk_bus: NULL
           boot_index: NULL
             image_id: NULL
*************************** 2. row ***************************
           created_at: 2017-02-02 02:29:31
           updated_at: 2017-02-27 10:59:42
           deleted_at: NULL
                   id: 9757
          device_name: /dev/vdc
delete_on_termination: 0
          snapshot_id: NULL
            volume_id: 153bcab4-1f88-440c-9782-3c661a7502a8
          volume_size: NULL
            no_device: NULL
      connection_info: {"driver_volume_type": "rbd", "serial": "153bcab4-1f88-440c-9782-3c661a7502a8", "data": {"secret_type": "ceph", "name": "cinder-ceph/volume-153bcab4-1f88-440c-9782-3c661a7502a8", "secret_uuid": null, "qos_specs": null, "hosts": ["10.7.1.202", "10.7.1.203", "10.7.1.204"], "auth_enabled": true, "access_mode": "rw", "auth_username": "cinder-ceph", "ports": ["6789", "6789", "6789"]}}
        instance_uuid: b52f9264-d8b3-406a-bf9b-d7d7471b13fc
              deleted: 0
          source_type: volume
     destination_type: volume
         guest_format: NULL
          device_type: disk
             disk_bus: virtio
           boot_index: NULL
             image_id: NULL

then it cause we fail to detach the volume and see the following error
since connection_info of row 1 is NULL.

2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher self._detach_volume(context, instance, bdm)
2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4801, in _detach_volume
2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher connection_info = jsonutils.loads(bdm.connection_info)
2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_serialization/jsonutils.py", line 215, in loads
2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher return json.loads(encodeutils.safe_decode(s, encoding), **kwargs)
2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/encodeutils.py", line 33, in safe_decode
2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher raise TypeError("%s can't be decoded" % type(text))
2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher TypeError: <type 'NoneType'> can't be decoded

This kind of dirty data can be produced when happened to fail to run this line _attach_volume()#volume_bdm.destroy() [1], I think these conditions may cause it to happen:
1, lose the database during the operation volume_bdm.destroy()
2, lose an MQ connection or RPC timing out during the operation volume_bdm.destroy()

If you lose the database during any operation, things are going to be
bad, so in general I'm not sure how realistic guarding for that case is.
Losing an MQ connection or RPC timing out is probably more realistic.
Seems the fix [2] is trying to solve the point 2.

However, I'm thinking if we can bypass the dirty BDM entry according to
the condition that connection_info is NULL no matter how it is produced.


[1] https://github.com/openstack/nova/blob/master/nova/compute/api.py#L3724
[2] https://review.openstack.org/#/c/290793

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1681998

Title:
  Bypass the dirty BDM enty no matter how it is produced

Status in OpenStack Compute (nova):
  New

Bug description:
  Sometimes the following dirty BDM enty (1.row) can be seen in the
  database that multiple BDMs with the same image_id and instance_uuid.

  mysql> select * from block_device_mapping where volume_id='153bcab4-1f88-440c-9782-3c661a7502a8' \G
  *************************** 1. row ***************************
             created_at: 2017-02-02 02:28:45
             updated_at: NULL
             deleted_at: NULL
                     id: 9754
            device_name: /dev/vdb
  delete_on_termination: 0
            snapshot_id: NULL
              volume_id: 153bcab4-1f88-440c-9782-3c661a7502a8
            volume_size: NULL
              no_device: NULL
        connection_info: NULL
          instance_uuid: b52f9264-d8b3-406a-bf9b-d7d7471b13fc
                deleted: 0
            source_type: volume
       destination_type: volume
           guest_format: NULL
            device_type: NULL
               disk_bus: NULL
             boot_index: NULL
               image_id: NULL
  *************************** 2. row ***************************
             created_at: 2017-02-02 02:29:31
             updated_at: 2017-02-27 10:59:42
             deleted_at: NULL
                     id: 9757
            device_name: /dev/vdc
  delete_on_termination: 0
            snapshot_id: NULL
              volume_id: 153bcab4-1f88-440c-9782-3c661a7502a8
            volume_size: NULL
              no_device: NULL
        connection_info: {"driver_volume_type": "rbd", "serial": "153bcab4-1f88-440c-9782-3c661a7502a8", "data": {"secret_type": "ceph", "name": "cinder-ceph/volume-153bcab4-1f88-440c-9782-3c661a7502a8", "secret_uuid": null, "qos_specs": null, "hosts": ["10.7.1.202", "10.7.1.203", "10.7.1.204"], "auth_enabled": true, "access_mode": "rw", "auth_username": "cinder-ceph", "ports": ["6789", "6789", "6789"]}}
          instance_uuid: b52f9264-d8b3-406a-bf9b-d7d7471b13fc
                deleted: 0
            source_type: volume
       destination_type: volume
           guest_format: NULL
            device_type: disk
               disk_bus: virtio
             boot_index: NULL
               image_id: NULL

  then it cause we fail to detach the volume and see the following error
  since connection_info of row 1 is NULL.

  2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher self._detach_volume(context, instance, bdm)
  2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 4801, in _detach_volume
  2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher connection_info = jsonutils.loads(bdm.connection_info)
  2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_serialization/jsonutils.py", line 215, in loads
  2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher return json.loads(encodeutils.safe_decode(s, encoding), **kwargs)
  2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/dist-packages/oslo_utils/encodeutils.py", line 33, in safe_decode
  2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher raise TypeError("%s can't be decoded" % type(text))
  2017-03-23 13:28:05.360 1865733 TRACE oslo_messaging.rpc.dispatcher TypeError: <type 'NoneType'> can't be decoded

  This kind of dirty data can be produced when happened to fail to run this line _attach_volume()#volume_bdm.destroy() [1], I think these conditions may cause it to happen:
  1, lose the database during the operation volume_bdm.destroy()
  2, lose an MQ connection or RPC timing out during the operation volume_bdm.destroy()

  If you lose the database during any operation, things are going to be
  bad, so in general I'm not sure how realistic guarding for that case
  is. Losing an MQ connection or RPC timing out is probably more
  realistic. Seems the fix [2] is trying to solve the point 2.

  However, I'm thinking if we can bypass the dirty BDM entry according
  to the condition that connection_info is NULL no matter how it is
  produced.

  
  [1] https://github.com/openstack/nova/blob/master/nova/compute/api.py#L3724
  [2] https://review.openstack.org/#/c/290793

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1681998/+subscriptions