yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1460308] [NEW] nova instance network_info is missed after a nova instance hard reboot.

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: TonyWang <tw_nova@xxxxxxx>
Date: Sat, 30 May 2015 14:55:32 -0000
Reply-to: Bug 1460308 <1460308@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

symptom:
1. nova instance lost all the vNIC after a hard reboot, and the nova instance is created by heat stack.
2. The reason of vNIC lost is caused by "network_info" data in database table "instance_info_caches" is empty(The value is "[]").

I tried to find the root cause, and found a race condition of nova code:
nova-compute: ComputeManager.build_and_run_instance
--> ComputeManager._allocate_network
    --> ComputeManager._allocate_network_async
        --> network.neutronv2.api.API.allocate_for_instance
            --> send "update_port" request to neutron-server.
                --> neutron-server sends nova notification after
                    "update_port" is done.
                --> nova-api: ServerExternalEventsController.create
                    --> objects.Instance.get_by_uuid
                        --> objects.Instance._from_db_object
                            --> objects.InstanceInfoCache._from_db_object
                                --> read "network_info" from database.
                                   (task2)

                --> nova-compute: ComputeManager.external_instance_event
                    --> network.neutronv2.api.API.get_instance_nw_info
                        --> save InstanceInfoCache.network_info into
                            database. (task3)

   --> nova-compute: network.neutronv2.api.API.get_instance_nw_info
       --> network.neutronv2.api.API._get_instance_nw_info
       --> update_instance_cache_with_nw_info
           --> save InstanceInfoCache.network_info into database. (task1)

"task1" and "task2" + "task3" are race condition.

If "task2" is faster than "task1", nova-api will read empty network_info from database, and notify nova-compute to save empty
network_info into database.

If "task3" is slower than "task1", nova-compute will use empty
network_info to overwrite the right value.

The race condition order is:
- task2
- task1
- task3

I met this issue many times in my OpenStack environment, and the
OpenStack version is Juno

** Affects: nova
     Importance: Undecided
         Status: New


** Tags: nova

** Description changed:

  symptom:
  1. nova instance lost all the vNIC after a hard reboot, and the nova instance is created by heat stack.
  2. The reason of vNIC lost is caused by "network_info" data in database table "instance_info_caches" is empty(The value is "[]").
  
- I tried to find the root cause, and find a race condition from nova code:
+ I tried to find the root cause, and found a race condition of nova code:
  nova-compute: ComputeManager.build_and_run_instance
  --> ComputeManager._allocate_network
- 	--> ComputeManager._allocate_network_async
- 		--> network.neutronv2.api.API.allocate_for_instance
- 		     --> send "update_port" request to neutron-server.
-                            --> neutron-server sends nova notification after "update_port" is done.  
- 			        --> nova-api: ServerExternalEventsController.create
- 				     --> objects.Instance.get_by_uuid
- 					  --> objects.Instance._from_db_object
- 						--> objects.InstanceInfoCache._from_db_object
- 						     --> read "network_info" from database. (task2)
+     --> ComputeManager._allocate_network_async
+         --> network.neutronv2.api.API.allocate_for_instance
+             --> send "update_port" request to neutron-server.
+                 --> neutron-server sends nova notification after 
+                     "update_port" is done.
+                 --> nova-api: ServerExternalEventsController.create
+                     --> objects.Instance.get_by_uuid
+                         --> objects.Instance._from_db_object
+                             --> objects.InstanceInfoCache._from_db_object
+                                 --> read "network_info" from database.
+                                    (task2)
  
- 				--> nova-compute: ComputeManager.external_instance_event
- 					--> network.neutronv2.api.API.get_instance_nw_info
- 						--> save InstanceInfoCache.network_info into database. (task3)
+                 --> nova-compute: ComputeManager.external_instance_event
+                     --> network.neutronv2.api.API.get_instance_nw_info
+                         --> save InstanceInfoCache.network_info into
+                             database. (task3)
  
- 			--> nova-compute: network.neutronv2.api.API.get_instance_nw_info
- 				--> network.neutronv2.api.API._get_instance_nw_info
- 				--> update_instance_cache_with_nw_info
- 					--> save InstanceInfoCache.network_info into database. (task1)
- 					
+    --> nova-compute: network.neutronv2.api.API.get_instance_nw_info
+        --> network.neutronv2.api.API._get_instance_nw_info
+        --> update_instance_cache_with_nw_info
+            --> save InstanceInfoCache.network_info into database. (task1)
+ 
  "task1" and "task2" + "task3" are race condition.
  
  If "task2" is faster than "task1", nova-api will read empty network_info from database, and notify nova-compute to save empty
  network_info into database.
  
  If "task3" is slower than "task1", nova-compute will use empty
  network_info to overwrite the right value.
  
  The race condition order is:
  - task2
  - task1
  - task3
  
  I met this issue many times in my OpenStack environment, and the
  OpenStack version is Juno

** Description changed:

  symptom:
  1. nova instance lost all the vNIC after a hard reboot, and the nova instance is created by heat stack.
  2. The reason of vNIC lost is caused by "network_info" data in database table "instance_info_caches" is empty(The value is "[]").
  
  I tried to find the root cause, and found a race condition of nova code:
  nova-compute: ComputeManager.build_and_run_instance
  --> ComputeManager._allocate_network
      --> ComputeManager._allocate_network_async
          --> network.neutronv2.api.API.allocate_for_instance
              --> send "update_port" request to neutron-server.
-                 --> neutron-server sends nova notification after 
-                     "update_port" is done.
+                 --> neutron-server sends nova notification after
+                     "update_port" is done.
                  --> nova-api: ServerExternalEventsController.create
                      --> objects.Instance.get_by_uuid
                          --> objects.Instance._from_db_object
                              --> objects.InstanceInfoCache._from_db_object
                                  --> read "network_info" from database.
-                                    (task2)
+                                    (task2)
  
                  --> nova-compute: ComputeManager.external_instance_event
                      --> network.neutronv2.api.API.get_instance_nw_info
                          --> save InstanceInfoCache.network_info into
-                             database. (task3)
+                             database. (task3)
  
     --> nova-compute: network.neutronv2.api.API.get_instance_nw_info
         --> network.neutronv2.api.API._get_instance_nw_info
         --> update_instance_cache_with_nw_info
             --> save InstanceInfoCache.network_info into database. (task1)
  
  "task1" and "task2" + "task3" are race condition.
  
  If "task2" is faster than "task1", nova-api will read empty network_info from database, and notify nova-compute to save empty
  network_info into database.
  
  If "task3" is slower than "task1", nova-compute will use empty
  network_info to overwrite the right value.
  
  The race condition order is:
  - task2
  - task1
  - task3
  
  I met this issue many times in my OpenStack environment, and the
  OpenStack version is Juno

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1460308

Title:
  nova instance network_info is missed after a nova instance hard
  reboot.

Status in OpenStack Compute (Nova):
  New

Bug description:
  symptom:
  1. nova instance lost all the vNIC after a hard reboot, and the nova instance is created by heat stack.
  2. The reason of vNIC lost is caused by "network_info" data in database table "instance_info_caches" is empty(The value is "[]").

  I tried to find the root cause, and found a race condition of nova code:
  nova-compute: ComputeManager.build_and_run_instance
  --> ComputeManager._allocate_network
      --> ComputeManager._allocate_network_async
          --> network.neutronv2.api.API.allocate_for_instance
              --> send "update_port" request to neutron-server.
                  --> neutron-server sends nova notification after
                      "update_port" is done.
                  --> nova-api: ServerExternalEventsController.create
                      --> objects.Instance.get_by_uuid
                          --> objects.Instance._from_db_object
                              --> objects.InstanceInfoCache._from_db_object
                                  --> read "network_info" from database.
                                     (task2)

                  --> nova-compute: ComputeManager.external_instance_event
                      --> network.neutronv2.api.API.get_instance_nw_info
                          --> save InstanceInfoCache.network_info into
                              database. (task3)

     --> nova-compute: network.neutronv2.api.API.get_instance_nw_info
         --> network.neutronv2.api.API._get_instance_nw_info
         --> update_instance_cache_with_nw_info
             --> save InstanceInfoCache.network_info into database. (task1)

  "task1" and "task2" + "task3" are race condition.

  If "task2" is faster than "task1", nova-api will read empty network_info from database, and notify nova-compute to save empty
  network_info into database.

  If "task3" is slower than "task1", nova-compute will use empty
  network_info to overwrite the right value.

  The race condition order is:
  - task2
  - task1
  - task3

  I met this issue many times in my OpenStack environment, and the
  OpenStack version is Juno

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1460308/+subscriptions
Follow ups

[Bug 1460308] [NEW] nova instance network_info is missed after a nova instance hard reboot.
From: TonyWang, 2015-05-30
References

[Bug 1460308] [NEW] nova instance network_info is missed after a nova instance hard reboot.
From: TonyWang, 2015-05-30