yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #84706
[Bug 1906798] [NEW] image cache manager removes used backing files on NFS shared storage
Public bug reported:
Description
===========
After a site electrical maintenance (power off for two days), most of the instances using ephemeral storage fail to start with "Error : Image <id> could not be found.".
The backing files for these instances in the "/var/lib/nova/instances/_base" folder are missing.
We are using ephemeral storage shared on NFS. Glance images are rebuilt
every day, so most instances do not share a common image.
Cause: after poweron a first compute runs _run_image_cache_manager_pass
First (storage_users.register_storage_use) the compute registers itself on the /var/lib/nova/instances/compute_nodes file.
Then (storage_users.get_storage_users), it reads the compute_nodes file. Since it is the first to register in the file in more the 24hours, it conclude it is the only one using the storage and remove all backing files not attached to instances it runs.
I think we should wait for at least "image_cache_manager_interval" to
allow time for the other hosts to register in compute_nodes before
actually removing base files.
Steps to reproduce
===========
Use shared NFS ephemeral storage on all computes.
1. create an instance on each compute, each time from a different glance images
2. remove all these images from glance
3. stop all instances
4. stop all nova_compute services
5. wait for 24 hours
(alternatively, echo '{}' > /var/lib/nova/instances/compute_nodes && touch -d "$(date '+%Y-%m-%d %H:%M:%S' -d '1 day ago')" /var/lib/nova/instances/_base/* )
5. start all nova_compute services
6. wait for the image cache manager to trigger (~ image_cache_manager_interval default 40mn)
7. start all instances
Expected result
===============
All instances start
Actual result
=============
All the instances fail to start, except on one compute
Environment
===========
tested with nova version 10.1.0 and 13.0.2
on libvirt KVM and shared NFS Netapp storage
** Affects: nova
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1906798
Title:
image cache manager removes used backing files on NFS shared storage
Status in OpenStack Compute (nova):
New
Bug description:
Description
===========
After a site electrical maintenance (power off for two days), most of the instances using ephemeral storage fail to start with "Error : Image <id> could not be found.".
The backing files for these instances in the "/var/lib/nova/instances/_base" folder are missing.
We are using ephemeral storage shared on NFS. Glance images are
rebuilt every day, so most instances do not share a common image.
Cause: after poweron a first compute runs _run_image_cache_manager_pass
First (storage_users.register_storage_use) the compute registers itself on the /var/lib/nova/instances/compute_nodes file.
Then (storage_users.get_storage_users), it reads the compute_nodes file. Since it is the first to register in the file in more the 24hours, it conclude it is the only one using the storage and remove all backing files not attached to instances it runs.
I think we should wait for at least "image_cache_manager_interval" to
allow time for the other hosts to register in compute_nodes before
actually removing base files.
Steps to reproduce
===========
Use shared NFS ephemeral storage on all computes.
1. create an instance on each compute, each time from a different glance images
2. remove all these images from glance
3. stop all instances
4. stop all nova_compute services
5. wait for 24 hours
(alternatively, echo '{}' > /var/lib/nova/instances/compute_nodes && touch -d "$(date '+%Y-%m-%d %H:%M:%S' -d '1 day ago')" /var/lib/nova/instances/_base/* )
5. start all nova_compute services
6. wait for the image cache manager to trigger (~ image_cache_manager_interval default 40mn)
7. start all instances
Expected result
===============
All instances start
Actual result
=============
All the instances fail to start, except on one compute
Environment
===========
tested with nova version 10.1.0 and 13.0.2
on libvirt KVM and shared NFS Netapp storage
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1906798/+subscriptions
Follow ups