← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1906798] [NEW] image cache manager removes used backing files on NFS shared storage

 

Public bug reported:

Description
===========
After a site electrical maintenance (power off for two days), most of the instances using ephemeral storage fail to start with "Error : Image <id> could not be found.".
The backing files for these instances in the "/var/lib/nova/instances/_base" folder are missing.

We are using ephemeral storage shared on NFS. Glance images are rebuilt
every day, so most instances do not share a common image.

Cause: after poweron a first compute runs _run_image_cache_manager_pass
First (storage_users.register_storage_use) the compute registers itself on the  /var/lib/nova/instances/compute_nodes file.
Then (storage_users.get_storage_users), it reads the compute_nodes file. Since it is the first to register in the file in more the 24hours, it conclude it is the only one using the storage and remove all backing files not attached to instances it runs. 

I think we should wait for at least "image_cache_manager_interval" to
allow time for the other hosts to register in compute_nodes before
actually removing base files.

Steps to reproduce
===========
Use shared NFS ephemeral storage on all computes.
1. create an instance on each compute, each time from a different glance images
2. remove all these images from glance
3. stop all instances
4. stop all nova_compute services
5. wait for 24 hours  
   (alternatively, echo '{}' > /var/lib/nova/instances/compute_nodes && touch -d "$(date '+%Y-%m-%d %H:%M:%S' -d '1 day ago')"  /var/lib/nova/instances/_base/* )
5. start all nova_compute services
6. wait for the image cache manager to trigger  (~ image_cache_manager_interval default 40mn)
7. start all instances

Expected result
===============
All instances start

Actual result
=============
All the instances fail to start, except on one compute

Environment
===========
tested with nova version 10.1.0 and 13.0.2
on libvirt KVM and shared NFS Netapp storage

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1906798

Title:
  image cache manager removes used backing files on NFS shared storage

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  After a site electrical maintenance (power off for two days), most of the instances using ephemeral storage fail to start with "Error : Image <id> could not be found.".
  The backing files for these instances in the "/var/lib/nova/instances/_base" folder are missing.

  We are using ephemeral storage shared on NFS. Glance images are
  rebuilt every day, so most instances do not share a common image.

  Cause: after poweron a first compute runs _run_image_cache_manager_pass
  First (storage_users.register_storage_use) the compute registers itself on the  /var/lib/nova/instances/compute_nodes file.
  Then (storage_users.get_storage_users), it reads the compute_nodes file. Since it is the first to register in the file in more the 24hours, it conclude it is the only one using the storage and remove all backing files not attached to instances it runs. 

  I think we should wait for at least "image_cache_manager_interval" to
  allow time for the other hosts to register in compute_nodes before
  actually removing base files.

  Steps to reproduce
  ===========
  Use shared NFS ephemeral storage on all computes.
  1. create an instance on each compute, each time from a different glance images
  2. remove all these images from glance
  3. stop all instances
  4. stop all nova_compute services
  5. wait for 24 hours  
     (alternatively, echo '{}' > /var/lib/nova/instances/compute_nodes && touch -d "$(date '+%Y-%m-%d %H:%M:%S' -d '1 day ago')"  /var/lib/nova/instances/_base/* )
  5. start all nova_compute services
  6. wait for the image cache manager to trigger  (~ image_cache_manager_interval default 40mn)
  7. start all instances

  Expected result
  ===============
  All instances start

  Actual result
  =============
  All the instances fail to start, except on one compute

  Environment
  ===========
  tested with nova version 10.1.0 and 13.0.2
  on libvirt KVM and shared NFS Netapp storage

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1906798/+subscriptions


Follow ups