openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #08764
[NOVA] Snapshotting may require significant disk space (in /tmp). How to properly solve disk space issues?
Hi Stackers,
So, in diagnosing a few things on TryStack yesterday, I ran into an
interesting problem with snapshotting that I'm hoping to get some advice on.
== The Problem ==
The TryStack codebase is Diablo, however the code involved in this
particular problem I believe is the same in Essex...
The issue that was happening was a user was attempting to snapshot a
tiny instance (512MB/1-core) through the dashboard. The dashboard
returned and noted that a snapshot was created and was in Queued status.
The snapshot never goes out of Queued status, and so I logged into the
compute node that housed the instance in question to see if I could
figure out what was going on.
Grepping through the compute log, I found the following:
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File
"/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 628, in
_process_data
(nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File
"/usr/lib/python2.7/dist-packages/nova/exception.py", line 100, in wrapped
(nova.rpc): TRACE: return f(*args, **kw)
(nova.rpc): TRACE: File
"/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 687, in
snapshot_instance
(nova.rpc): TRACE: self.driver.snapshot(context, instance_ref, image_id)
(nova.rpc): TRACE: File
"/usr/lib/python2.7/dist-packages/nova/exception.py", line 100, in wrapped
(nova.rpc): TRACE: return f(*args, **kw)
(nova.rpc): TRACE: File
"/usr/lib/python2.7/dist-packages/nova/virt/libvirt/connection.py", line
479, in snapshot
(nova.rpc): TRACE: utils.execute(*qemu_img_cmd)
(nova.rpc): TRACE: File
"/usr/lib/python2.7/dist-packages/nova/utils.py", line 190, in execute
(nova.rpc): TRACE: cmd=' '.join(cmd))
(nova.rpc): TRACE: ProcessExecutionError: Unexpected error while running
command.
(nova.rpc): TRACE: Command: qemu-img convert -f qcow2 -O raw -s
e7ba4fb5f6f04f99b07d1d222ada0219
/opt/openstack/nova/instances/instance-00000548/disk
/tmp/tmpIuOQo0/e7ba4fb5f6f04f99b07d1d222ada0219
(nova.rpc): TRACE: Exit code: 1
(nova.rpc): TRACE: Stdout: ''
(nova.rpc): TRACE: Stderr: 'qemu-img: error while writing\n'
QEMU was unhelpfully returning a vague error message of "error while
writing".
It turned out, after speaking with a couple folks on IRC (thx vishy and
rmk!) that the snapshot process (qemu-img convert ... above) is storing
the output of the process (the snapshot) in a temporary directory
created using tempfile.mkdtemp() in the nova/virt/libvirt/connection.py
file.
As it turns out, the base operating system we install on our compute
nodes in TryStack has a (very) small root partition -- only 2GB in size
(we use the devstack build_pxe_env.sh script to create the base Ubuntu
image that is netbooted on the compute nodes.
Looking at the free disk space on the compute node in question, the
problem was apparent:
root@freecloud102:/var/log/nova# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/ram0 2.0G 1.4G 535M 73% /
devtmpfs 48G 240K 48G 1% /dev
none 48G 0 48G 0% /dev/shm
none 48G 212K 48G 1% /var/run
none 48G 0 48G 0% /var/lock
/dev/md0 5.4T 93G 5.1T 2% /opt/openstack
There simply isn't enough free space on the root partition (which is
where /tmp is housed) for the snapshot to be created.
== Possible Solutions ==
So, there are a number of solutions that we can work on here, and I'm
wondering what the preference would be. Here are the solutions I have
come up with, along with a no-brainer improvement to Nova that would
help in diagnosing this problem:
The no-brainer: Detect before attempting a snapshot that there is enough
space on a device to perform the operation, and if not, throw a useful
error message up the stack
Solutions to the disk space problem:
(1) Silly Jay, change the damn size of the root partition in your PXE
base OS install!
Now, I'm no expert in creating customized base disk images, but from
looking at the build_pxe_env.sh script in devstack [1], it seems pretty
trivial to change the ramdisk_size parameter in the startup options to
something larger than 2109600. We could do this and reimage the compute
nodes one by one.
(2) Make the location in which the snapshot is made configurable.
Right now, as mentioned above, tempfile.mkdtemp() is used, which creates
a directory in the user's TMPDIR (typically /tmp, which is usually on
the root partition).
We could add an option (--libvirt-snapshot-dir?) that would allow
nova-compute to override where that snapshot is built.
(3) Change the user (running nova-compute) TMPDIR setting to something
different than /tmp on the root partition).
Thoughts?
-jay
[1]
https://github.com/openstack-dev/devstack/blob/stable/diablo/tools/build_pxe_env.sh
Follow ups