yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1493760] Re: rbd backend reports wrong 'local_gb_used' for compute node

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Mark Doffman <mjdoffma@xxxxxxxxxx>
Date: Thu, 24 Sep 2015 21:28:02 -0000
Reply-to: Bug 1493760 <1493760@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

If you look at https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4628
you can see the three different functions used for getting available disk space.

'get_volume_group_info'
'get_pool_info' and
'get_fs_info'

All of these methods are going to return the ACTUAL disk space used,
rather than the theoretical maximum of all the instance sizes. This is
because disks stored locally will be stored as sparse qcow images. LVM
disks are sparse volumes.

I believe that the intention of 'local_gb_used' is to report the actual
disk space.


** Changed in: nova
       Status: New => Invalid

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1493760

Title:
  rbd backend reports  wrong 'local_gb_used' for compute node

Status in OpenStack Compute (nova):
  Invalid

Bug description:
  When  instance's  disk in rbd backend,  compute node reports the whole
  ceph cluster status,  that makes sense.  We get the local_gb usage in
  https://github.com/openstack/nova/blob/master/nova/virt/libvirt/storage/rbd_utils.py#L313

       def get_pool_info(self):
          with RADOSClient(self) as client:
              stats = client.cluster.get_cluster_stats()
              return {'total': stats['kb'] * units.Ki,
                      'free': stats['kb_avail'] * units.Ki,
                      'used': stats['kb_used'] * units.Ki}

  This reports same disk usages  with command 'ceph -s', for example:
  [root@node-1 ~]# ceph -s
      cluster e598930a-0807-491b-b191-d57244d3c8e2
       health HEALTH_OK
       monmap e1: 1 mons at {node-1=192.168.0.1:6789/0}, election epoch 1, quorum 0 node-1
       osdmap e28: 2 osds: 2 up, 2 in
        pgmap v3985: 576 pgs, 5 pools, 295 MB data, 57 objects
              21149 MB used, 76479 MB / 97628 MB avail
                   576 active+clean

  [root@node-1 ~]#  rbd -p compute ls
  45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk
  8c6a5555-394f-4c24-b7ff-e05fdf322155_disk
  944d9028-ac59-45fd-9be3-69066c8bc4e5
  9ea375dc-f0b8-472e-ba53-4d83e5721771_disk
  9fce4606-6871-40ca-bf8f-6146c05068e6_disk
  cedce585-8747-4798-885f-0c47337f0f6f_disk
  e17c9391-2032-4144-8fa1-85b092239e66_disk
  e19143c7-228c-4f89-9735-c27c333adce4_disk
  f9caf4a7-2b62-46c2-b2e1-f99cb4ce3f57_disk
  [root@node-1 ~]#  rbd -p compute info 45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk
  rbd image '45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk':
  	size 20480 MB in 2560 objects
  	order 23 (8192 kB objects)
  	block_name_prefix: rbd_data.39ab250fe98b
  	format: 2
  	features: layering
  	parent: compute/944d9028-ac59-45fd-9be3-69066c8bc4e5@snap
  	overlap: 40162 kB

  In above example. we have two compute node , and can create 4
  instances with 20G disk in each compute. The interesting thing is the
  total  local_gb is 95G, and allocate 160G for instances.

  
  The root cause is client.cluster.get_cluster_stats() returns  actual  used  size, means  20G instance disk maybe only occupy  200M bytes.   This is dangerous when instance use all of their disk.

  An alternative solution fo calcuate  all instance's  disk size by some
  way  as local_gb_used.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1493760/+subscriptions

References

[Bug 1493760] [NEW] rbd backend reports wrong 'local_gb_used' for compute node
From: ChangBo Guo(gcb), 2015-09-09