yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #33274
[Bug 1460308] [NEW] nova instance network_info is missed after a nova instance hard reboot.
Public bug reported:
symptom:
1. nova instance lost all the vNIC after a hard reboot, and the nova instance is created by heat stack.
2. The reason of vNIC lost is caused by "network_info" data in database table "instance_info_caches" is empty(The value is "[]").
I tried to find the root cause, and found a race condition of nova code:
nova-compute: ComputeManager.build_and_run_instance
--> ComputeManager._allocate_network
--> ComputeManager._allocate_network_async
--> network.neutronv2.api.API.allocate_for_instance
--> send "update_port" request to neutron-server.
--> neutron-server sends nova notification after
"update_port" is done.
--> nova-api: ServerExternalEventsController.create
--> objects.Instance.get_by_uuid
--> objects.Instance._from_db_object
--> objects.InstanceInfoCache._from_db_object
--> read "network_info" from database.
(task2)
--> nova-compute: ComputeManager.external_instance_event
--> network.neutronv2.api.API.get_instance_nw_info
--> save InstanceInfoCache.network_info into
database. (task3)
--> nova-compute: network.neutronv2.api.API.get_instance_nw_info
--> network.neutronv2.api.API._get_instance_nw_info
--> update_instance_cache_with_nw_info
--> save InstanceInfoCache.network_info into database. (task1)
"task1" and "task2" + "task3" are race condition.
If "task2" is faster than "task1", nova-api will read empty network_info from database, and notify nova-compute to save empty
network_info into database.
If "task3" is slower than "task1", nova-compute will use empty
network_info to overwrite the right value.
The race condition order is:
- task2
- task1
- task3
I met this issue many times in my OpenStack environment, and the
OpenStack version is Juno
** Affects: nova
Importance: Undecided
Status: New
** Tags: nova
** Description changed:
symptom:
1. nova instance lost all the vNIC after a hard reboot, and the nova instance is created by heat stack.
2. The reason of vNIC lost is caused by "network_info" data in database table "instance_info_caches" is empty(The value is "[]").
- I tried to find the root cause, and find a race condition from nova code:
+ I tried to find the root cause, and found a race condition of nova code:
nova-compute: ComputeManager.build_and_run_instance
--> ComputeManager._allocate_network
- --> ComputeManager._allocate_network_async
- --> network.neutronv2.api.API.allocate_for_instance
- --> send "update_port" request to neutron-server.
- --> neutron-server sends nova notification after "update_port" is done.
- --> nova-api: ServerExternalEventsController.create
- --> objects.Instance.get_by_uuid
- --> objects.Instance._from_db_object
- --> objects.InstanceInfoCache._from_db_object
- --> read "network_info" from database. (task2)
+ --> ComputeManager._allocate_network_async
+ --> network.neutronv2.api.API.allocate_for_instance
+ --> send "update_port" request to neutron-server.
+ --> neutron-server sends nova notification after
+ "update_port" is done.
+ --> nova-api: ServerExternalEventsController.create
+ --> objects.Instance.get_by_uuid
+ --> objects.Instance._from_db_object
+ --> objects.InstanceInfoCache._from_db_object
+ --> read "network_info" from database.
+ (task2)
- --> nova-compute: ComputeManager.external_instance_event
- --> network.neutronv2.api.API.get_instance_nw_info
- --> save InstanceInfoCache.network_info into database. (task3)
+ --> nova-compute: ComputeManager.external_instance_event
+ --> network.neutronv2.api.API.get_instance_nw_info
+ --> save InstanceInfoCache.network_info into
+ database. (task3)
- --> nova-compute: network.neutronv2.api.API.get_instance_nw_info
- --> network.neutronv2.api.API._get_instance_nw_info
- --> update_instance_cache_with_nw_info
- --> save InstanceInfoCache.network_info into database. (task1)
-
+ --> nova-compute: network.neutronv2.api.API.get_instance_nw_info
+ --> network.neutronv2.api.API._get_instance_nw_info
+ --> update_instance_cache_with_nw_info
+ --> save InstanceInfoCache.network_info into database. (task1)
+
"task1" and "task2" + "task3" are race condition.
If "task2" is faster than "task1", nova-api will read empty network_info from database, and notify nova-compute to save empty
network_info into database.
If "task3" is slower than "task1", nova-compute will use empty
network_info to overwrite the right value.
The race condition order is:
- task2
- task1
- task3
I met this issue many times in my OpenStack environment, and the
OpenStack version is Juno
** Description changed:
symptom:
1. nova instance lost all the vNIC after a hard reboot, and the nova instance is created by heat stack.
2. The reason of vNIC lost is caused by "network_info" data in database table "instance_info_caches" is empty(The value is "[]").
I tried to find the root cause, and found a race condition of nova code:
nova-compute: ComputeManager.build_and_run_instance
--> ComputeManager._allocate_network
--> ComputeManager._allocate_network_async
--> network.neutronv2.api.API.allocate_for_instance
--> send "update_port" request to neutron-server.
- --> neutron-server sends nova notification after
- "update_port" is done.
+ --> neutron-server sends nova notification after
+ "update_port" is done.
--> nova-api: ServerExternalEventsController.create
--> objects.Instance.get_by_uuid
--> objects.Instance._from_db_object
--> objects.InstanceInfoCache._from_db_object
--> read "network_info" from database.
- (task2)
+ (task2)
--> nova-compute: ComputeManager.external_instance_event
--> network.neutronv2.api.API.get_instance_nw_info
--> save InstanceInfoCache.network_info into
- database. (task3)
+ database. (task3)
--> nova-compute: network.neutronv2.api.API.get_instance_nw_info
--> network.neutronv2.api.API._get_instance_nw_info
--> update_instance_cache_with_nw_info
--> save InstanceInfoCache.network_info into database. (task1)
"task1" and "task2" + "task3" are race condition.
If "task2" is faster than "task1", nova-api will read empty network_info from database, and notify nova-compute to save empty
network_info into database.
If "task3" is slower than "task1", nova-compute will use empty
network_info to overwrite the right value.
The race condition order is:
- task2
- task1
- task3
I met this issue many times in my OpenStack environment, and the
OpenStack version is Juno
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1460308
Title:
nova instance network_info is missed after a nova instance hard
reboot.
Status in OpenStack Compute (Nova):
New
Bug description:
symptom:
1. nova instance lost all the vNIC after a hard reboot, and the nova instance is created by heat stack.
2. The reason of vNIC lost is caused by "network_info" data in database table "instance_info_caches" is empty(The value is "[]").
I tried to find the root cause, and found a race condition of nova code:
nova-compute: ComputeManager.build_and_run_instance
--> ComputeManager._allocate_network
--> ComputeManager._allocate_network_async
--> network.neutronv2.api.API.allocate_for_instance
--> send "update_port" request to neutron-server.
--> neutron-server sends nova notification after
"update_port" is done.
--> nova-api: ServerExternalEventsController.create
--> objects.Instance.get_by_uuid
--> objects.Instance._from_db_object
--> objects.InstanceInfoCache._from_db_object
--> read "network_info" from database.
(task2)
--> nova-compute: ComputeManager.external_instance_event
--> network.neutronv2.api.API.get_instance_nw_info
--> save InstanceInfoCache.network_info into
database. (task3)
--> nova-compute: network.neutronv2.api.API.get_instance_nw_info
--> network.neutronv2.api.API._get_instance_nw_info
--> update_instance_cache_with_nw_info
--> save InstanceInfoCache.network_info into database. (task1)
"task1" and "task2" + "task3" are race condition.
If "task2" is faster than "task1", nova-api will read empty network_info from database, and notify nova-compute to save empty
network_info into database.
If "task3" is slower than "task1", nova-compute will use empty
network_info to overwrite the right value.
The race condition order is:
- task2
- task1
- task3
I met this issue many times in my OpenStack environment, and the
OpenStack version is Juno
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1460308/+subscriptions
Follow ups
References