← Back to team overview

openstack team mailing list archive

[NOVA] Snapshotting may require significant disk space (in /tmp). How to properly solve disk space issues?

 

Hi Stackers,

So, in diagnosing a few things on TryStack yesterday, I ran into an interesting problem with snapshotting that I'm hoping to get some advice on.

== The Problem ==

The TryStack codebase is Diablo, however the code involved in this particular problem I believe is the same in Essex...

The issue that was happening was a user was attempting to snapshot a tiny instance (512MB/1-core) through the dashboard. The dashboard returned and noted that a snapshot was created and was in Queued status.

The snapshot never goes out of Queued status, and so I logged into the compute node that housed the instance in question to see if I could figure out what was going on.

Grepping through the compute log, I found the following:

(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 628, in _process_data
(nova.rpc): TRACE:     rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 100, in wrapped
(nova.rpc): TRACE:     return f(*args, **kw)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 687, in snapshot_instance
(nova.rpc): TRACE:     self.driver.snapshot(context, instance_ref, image_id)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/exception.py", line 100, in wrapped
(nova.rpc): TRACE:     return f(*args, **kw)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line 479, in snapshot
(nova.rpc): TRACE:     utils.execute(*qemu_img_cmd)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/utils.py", line 190, in execute
(nova.rpc): TRACE:     cmd=' '.join(cmd))
(nova.rpc): TRACE: ProcessExecutionError: Unexpected error while running command. (nova.rpc): TRACE: Command: qemu-img convert -f qcow2 -O raw -s e7ba4fb5f6f04f99b07d1d222ada0219 /opt/openstack/nova/instances/instance-00000548/disk /tmp/tmpIuOQo0/e7ba4fb5f6f04f99b07d1d222ada0219
(nova.rpc): TRACE: Exit code: 1
(nova.rpc): TRACE: Stdout: ''
(nova.rpc): TRACE: Stderr: 'qemu-img: error while writing\n'

QEMU was unhelpfully returning a vague error message of "error while writing".

It turned out, after speaking with a couple folks on IRC (thx vishy and rmk!) that the snapshot process (qemu-img convert ... above) is storing the output of the process (the snapshot) in a temporary directory created using tempfile.mkdtemp() in the nova/virt/libvirt/connection.py file.

As it turns out, the base operating system we install on our compute nodes in TryStack has a (very) small root partition -- only 2GB in size (we use the devstack build_pxe_env.sh script to create the base Ubuntu image that is netbooted on the compute nodes.

Looking at the free disk space on the compute node in question, the problem was apparent:

root@freecloud102:/var/log/nova# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/ram0             2.0G  1.4G  535M  73% /
devtmpfs               48G  240K   48G   1% /dev
none                   48G     0   48G   0% /dev/shm
none                   48G  212K   48G   1% /var/run
none                   48G     0   48G   0% /var/lock
/dev/md0              5.4T   93G  5.1T   2% /opt/openstack

There simply isn't enough free space on the root partition (which is where /tmp is housed) for the snapshot to be created.

== Possible Solutions ==

So, there are a number of solutions that we can work on here, and I'm wondering what the preference would be. Here are the solutions I have come up with, along with a no-brainer improvement to Nova that would help in diagnosing this problem:

The no-brainer: Detect before attempting a snapshot that there is enough space on a device to perform the operation, and if not, throw a useful error message up the stack

Solutions to the disk space problem:

(1) Silly Jay, change the damn size of the root partition in your PXE base OS install!

Now, I'm no expert in creating customized base disk images, but from looking at the build_pxe_env.sh script in devstack [1], it seems pretty trivial to change the ramdisk_size parameter in the startup options to something larger than 2109600. We could do this and reimage the compute nodes one by one.

(2) Make the location in which the snapshot is made configurable.

Right now, as mentioned above, tempfile.mkdtemp() is used, which creates a directory in the user's TMPDIR (typically /tmp, which is usually on the root partition).

We could add an option (--libvirt-snapshot-dir?) that would allow nova-compute to override where that snapshot is built.

(3) Change the user (running nova-compute) TMPDIR setting to something different than /tmp on the root partition).

Thoughts?
-jay

[1] https://github.com/openstack-dev/devstack/blob/stable/diablo/tools/build_pxe_env.sh


Follow ups