yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1493760] [NEW] rbd backend reports wrong 'local_gb_used' for compute node

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: "ChangBo Guo\(gcb\)" <glongwave@xxxxxxxxx>
Date: Wed, 09 Sep 2015 09:25:56 -0000
Reply-to: Bug 1493760 <1493760@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

When  instance's  disk in rbd backend,  compute node reports the whole
ceph cluster status,  that makes sense.  We get the local_gb usage in
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/storage/rbd_utils.py#L313

     def get_pool_info(self):
        with RADOSClient(self) as client:
            stats = client.cluster.get_cluster_stats()
            return {'total': stats['kb'] * units.Ki,
                    'free': stats['kb_avail'] * units.Ki,
                    'used': stats['kb_used'] * units.Ki}

This reports same disk usages  with command 'ceph -s', for example:
[root@node-1 ~]# ceph -s
    cluster e598930a-0807-491b-b191-d57244d3c8e2
     health HEALTH_OK
     monmap e1: 1 mons at {node-1=192.168.0.1:6789/0}, election epoch 1, quorum 0 node-1
     osdmap e28: 2 osds: 2 up, 2 in
      pgmap v3985: 576 pgs, 5 pools, 295 MB data, 57 objects
            21149 MB used, 76479 MB / 97628 MB avail
                 576 active+clean

[root@node-1 ~]#  rbd -p compute ls
45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk
8c6a5555-394f-4c24-b7ff-e05fdf322155_disk
944d9028-ac59-45fd-9be3-69066c8bc4e5
9ea375dc-f0b8-472e-ba53-4d83e5721771_disk
9fce4606-6871-40ca-bf8f-6146c05068e6_disk
cedce585-8747-4798-885f-0c47337f0f6f_disk
e17c9391-2032-4144-8fa1-85b092239e66_disk
e19143c7-228c-4f89-9735-c27c333adce4_disk
f9caf4a7-2b62-46c2-b2e1-f99cb4ce3f57_disk
[root@node-1 ~]#  rbd -p compute info 45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk
rbd image '45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk':
	size 20480 MB in 2560 objects
	order 23 (8192 kB objects)
	block_name_prefix: rbd_data.39ab250fe98b
	format: 2
	features: layering
	parent: compute/944d9028-ac59-45fd-9be3-69066c8bc4e5@snap
	overlap: 40162 kB

In above example. we have two compute node , and can create 4 instances
with 20G disk in each compute. The interesting thing is the total
local_gb is 95G, and allocate 160G for instances.


The root cause is client.cluster.get_cluster_stats() returns  actual  used  size, means  20G instance disk maybe only occupy  200M bytes.   This is dangerous when instance use all of their disk.

An alternative solution fo calcuate  all instance's  disk size by some
way  as local_gb_used.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1493760

Title:
  rbd backend reports  wrong 'local_gb_used' for compute node

Status in OpenStack Compute (nova):
  New

Bug description:
  When  instance's  disk in rbd backend,  compute node reports the whole
  ceph cluster status,  that makes sense.  We get the local_gb usage in
  https://github.com/openstack/nova/blob/master/nova/virt/libvirt/storage/rbd_utils.py#L313

       def get_pool_info(self):
          with RADOSClient(self) as client:
              stats = client.cluster.get_cluster_stats()
              return {'total': stats['kb'] * units.Ki,
                      'free': stats['kb_avail'] * units.Ki,
                      'used': stats['kb_used'] * units.Ki}

  This reports same disk usages  with command 'ceph -s', for example:
  [root@node-1 ~]# ceph -s
      cluster e598930a-0807-491b-b191-d57244d3c8e2
       health HEALTH_OK
       monmap e1: 1 mons at {node-1=192.168.0.1:6789/0}, election epoch 1, quorum 0 node-1
       osdmap e28: 2 osds: 2 up, 2 in
        pgmap v3985: 576 pgs, 5 pools, 295 MB data, 57 objects
              21149 MB used, 76479 MB / 97628 MB avail
                   576 active+clean

  [root@node-1 ~]#  rbd -p compute ls
  45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk
  8c6a5555-394f-4c24-b7ff-e05fdf322155_disk
  944d9028-ac59-45fd-9be3-69066c8bc4e5
  9ea375dc-f0b8-472e-ba53-4d83e5721771_disk
  9fce4606-6871-40ca-bf8f-6146c05068e6_disk
  cedce585-8747-4798-885f-0c47337f0f6f_disk
  e17c9391-2032-4144-8fa1-85b092239e66_disk
  e19143c7-228c-4f89-9735-c27c333adce4_disk
  f9caf4a7-2b62-46c2-b2e1-f99cb4ce3f57_disk
  [root@node-1 ~]#  rbd -p compute info 45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk
  rbd image '45892200-97cb-4fa4-9c29-35a0ff9e16f6_disk':
  	size 20480 MB in 2560 objects
  	order 23 (8192 kB objects)
  	block_name_prefix: rbd_data.39ab250fe98b
  	format: 2
  	features: layering
  	parent: compute/944d9028-ac59-45fd-9be3-69066c8bc4e5@snap
  	overlap: 40162 kB

  In above example. we have two compute node , and can create 4
  instances with 20G disk in each compute. The interesting thing is the
  total  local_gb is 95G, and allocate 160G for instances.

  
  The root cause is client.cluster.get_cluster_stats() returns  actual  used  size, means  20G instance disk maybe only occupy  200M bytes.   This is dangerous when instance use all of their disk.

  An alternative solution fo calcuate  all instance's  disk size by some
  way  as local_gb_used.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1493760/+subscriptions
Follow ups

[Bug 1493760] Re: rbd backend reports wrong 'local_gb_used' for compute node
From: Mark Doffman, 2015-09-24