yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #83029
[Bug 1832814] Re: Placement API appears to have issues when compute host replaced
[Expired for OpenStack Compute (nova) because there has been no activity
for 60 days.]
** Changed in: nova
Status: Incomplete => Expired
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1832814
Title:
Placement API appears to have issues when compute host replaced
Status in OpenStack Compute (nova):
Expired
Bug description:
We have been upgrading our sites from RDO to OSA. This process
involved live migrating all VMs from a compute host before
reinstalling it with OSA playbooks.
Note: the compute host is not "removed" from openstack in anyway; the
new OSA node is the *same* hardware, same hostname etc - just
reinstalled as OSA.
This appears to have consequences for the way the placement API works
- we have noticed that when live migrating the scheduler will often
choose a highly loaded node where an empty node exists - for example -
in the below output from my live migration script the VM is being
migrated from cc-compute04-kna1; the scheduler has chosen cc-
compute01-kna1 as the target this despite the load it currently has,
and that compute09, 15 and 18 are all empty
Migration Destination: cc-compute01-kna1
Migration ID: 12993
+-------------------+-----------------DEBUG------------+-----+-----------+---------+
+-------------------+----------------------------------+-----+-----------+---------+
| Host | Project | CPU | Memory MB | Disk GB |
+-------------------+----------------------------------+-----+-----------+---------+
| cc-compute04-kna1 | (used_now) | 124 | 254976 | 2790 |
| cc-compute01-kna1 | (used_now) | 230 | 466432 | 8210 |
+-------------------+----------------------------------+-----+-----------+---------+
| cc-compute03-kna1 | (used_now) | 174 | 327680 | 4740 |
| cc-compute05-kna1 | (used_now) | 198 | 457728 | 4430 |
| cc-compute06-kna1 | (used_now) | 163 | 366592 | 4650 |
| cc-compute07-kna1 | (used_now) | 170 | 415744 | 4460 |
| cc-compute08-kna1 | (used_now) | 178 | 382464 | 4750 |
| cc-compute09-kna1 | (used_now) | 0 | 2048 | 0 |
| cc-compute11-kna1 | (used_now) | 131 | 313856 | 3100 |
| cc-compute12-kna1 | (used_now) | 176 | 392704 | 4800 |
| cc-compute13-kna1 | (used_now) | 173 | 390656 | 5470 |
| cc-compute14-kna1 | (used_now) | 2 | 4096 | 50 |
| cc-compute15-kna1 | (used_now) | 0 | 2048 | 0 |
| cc-compute16-kna1 | (used_now) | 170 | 355840 | 5410 |
| cc-compute17-kna1 | (used_now) | 281 | 646656 | 5370 |
| cc-compute18-kna1 | (used_now) | 0 | 2048 | 0 |
| cc-compute19-kna1 | (used_now) | 207 | 517120 | 4860 |
| cc-compute20-kna1 | (used_now) | 223 | 560640 | 5150 |
| cc-compute23-kna1 | (used_now) | 184 | 406528 | 6350 |
| cc-compute24-kna1 | (used_now) | 190 | 585216 | 4820 |
| cc-compute25-kna1 | (used_now) | 235 | 491520 | 5500 |
| cc-compute26-kna1 | (used_now) | 283 | 610304 | 9390 |
| cc-compute27-kna1 | (used_now) | 200 | 573440 | 6730 |
| cc-compute28-kna1 | (used_now) | 269 | 587264 | 6600 |
| cc-compute29-kna1 | (used_now) | 245 | 494080 | 8480 |
+-------------------+----------------------------------+-----+-----------+---------+
this is not an isolated case, and is something we have seen frequently
to the point where we override the scheduler and use targeted
migrations to achieve better load balancing.
Interrogating the Placement API for a compute (09) prior to
reinstallation I can find the UUID
{
"generation": 480003,
"links": [
{
"href": "/resource_providers/d6aeeeb0-0cab-4e3f-a070-9808801b94a5",
"rel": "self"
},
{
"href": "/resource_providers/d6aeeeb0-0cab-4e3f-a070-9808801b94a5/inventories",
"rel": "inventories"
},
{
"href": "/resource_providers/d6aeeeb0-0cab-4e3f-a070-9808801b94a5/usages",
"rel": "usages"
},
{
"href": "/resource_providers/d6aeeeb0-0cab-4e3f-a070-9808801b94a5/aggregates",
"rel": "aggregates"
}
],
"name": "cc-compute09-kna1",
"uuid": "d6aeeeb0-0cab-4e3f-a070-9808801b94a5"
},
after the node is reinstalled it has a new UUID
{
"generation": 71,
"links": [
{
"href": "/resource_providers/d7f483ff-3b91-4d13-9900-0ec24c3a06a4",
"rel": "self"
},
{
"href": "/resource_providers/d7f483ff-3b91-4d13-9900-0ec24c3a06a4/inventories",
"rel": "inventories"
},
{
"href": "/resource_providers/d7f483ff-3b91-4d13-9900-0ec24c3a06a4/usages",
"rel": "usages"
},
{
"href": "/resource_providers/d7f483ff-3b91-4d13-9900-0ec24c3a06a4/aggregates",
"rel": "aggregates"
}
],
"name": "compute09.openstack.local",
"uuid": "d7f483ff-3b91-4d13-9900-0ec24c3a06a4"
},
this new resource provider shows 0 consumed resources
curl -g -X GET http://********:8780/resource_providers/d7f483ff-3b91-4d13-9900-0ec24c3a06a4/usages -H "Accept: application/json" -H "OpenStack-API-Version: placement 1.2" -H "User-Agent: openstacksdk/0.31.0 keystoneauth1/3.14.0 python-requests/2.22.0 CPython/2.7.12" -H "X-Auth-Token:************" | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 89 100 89 0 0 2870 0 --:--:-- --:--:-- --:--:-- 2870
{
"resource_provider_generation": 72,
"usages": {
"DISK_GB": 0,
"MEMORY_MB": 0,
"VCPU": 0
}
}
investigating the resource_providers table shows pottential duplicate
entries -
MariaDB [nova_api]> select * from resource_providers;
+---------------------+---------------------+-----+--------------------------------------+------------------------------+------------+----------+------------------+--------------------+
| created_at | updated_at | id | uuid | name | generation | can_host | root_provider_id | parent_provider_id |
+---------------------+---------------------+-----+--------------------------------------+------------------------------+------------+----------+------------------+--------------------+
| 2018-04-25 21:25:32 | 2019-04-17 07:08:24 | 1 | cbb2c235-ed5f-4f63-9015-1edfe91d63c8 | cc-compute02-kna1 | 195067 | 0 | 1 | NULL |
| 2018-04-25 21:44:17 | 2019-05-02 13:23:34 | 2 | 6125fdeb-370f-4139-9d1c-369e9eb4e620 | cc-compute-lsd01-kna1 | 41 | 0 | 2 | NULL |
| 2018-04-25 22:13:01 | 2019-05-20 13:11:55 | 3 | 452b7f99-a178-4dc7-9fea-e9d9ab6a3e99 | cc-compute05-kna1 | 450192 | 0 | 3 | NULL |
| 2018-04-25 22:13:08 | 2019-06-10 12:28:41 | 4 | 03b420df-79fb-4f0a-aede-bdbd62ce9ce3 | cc-compute03-kna1 | 424867 | 0 | 4 | NULL |
| 2018-04-25 22:13:08 | 2019-06-14 06:29:47 | 5 | 9386d418-339c-4010-baa5-18e2aa601a3c | cc-compute04-kna1 | 479160 | 0 | 5 | NULL |
| 2018-04-25 22:46:46 | 2019-05-20 13:39:00 | 6 | 7b0580e3-7592-4c3a-a0e9-a8d23f3550d7 | cc-compute07-kna1 | 441489 | 0 | 6 | NULL |
| 2018-04-25 22:46:47 | 2019-04-19 18:53:45 | 7 | 98e1b299-239f-488c-a7a0-3f78e76c8f6b | cc-compute06-kna1 | 396721 | 0 | 7 | NULL |
| 2018-04-25 22:46:50 | 2019-05-24 07:28:59 | 8 | 64c2b0fb-4d7e-4d5f-92bc-69e00a3cb85e | cc-compute08-kna1 | 449994 | 0 | 8 | NULL |
| 2018-04-26 00:47:56 | 2019-06-11 20:43:47 | 11 | 61708a8f-77fd-47dc-9140-6ea613509506 | cc-compute14-kna1 | 474210 | 0 | 11 | NULL |
| 2018-04-26 00:48:01 | 2019-05-09 12:20:15 | 12 | 9e082274-568d-49a2-9801-05b2390f7dfa | cc-compute16-kna1 | 432294 | 0 | 12 | NULL |
| 2018-04-26 00:48:04 | 2019-06-11 20:11:28 | 14 | 396bb173-2e46-4d35-963e-9b49acf0add8 | cc-compute22-kna1 | 448545 | 0 | 14 | NULL |
| 2018-04-26 00:48:06 | 2019-05-21 13:07:23 | 15 | 80e5f3a7-e4a3-43d1-a7a8-4c118fba7792 | cc-compute12-kna1 | 450359 | 0 | 15 | NULL |
| 2018-04-26 00:48:20 | 2019-05-16 14:32:54 | 18 | b86db974-5787-4012-a7df-26aeb8e73574 | cc-compute20-kna1 | 425960 | 0 | 18 | NULL |
| 2018-04-26 00:48:20 | 2019-06-12 12:24:24 | 19 | dfb35aab-2af9-4d86-bccb-76959c7f68ed | cc-compute18-kna1 | 435686 | 0 | 19 | NULL |
| 2018-04-26 00:48:22 | 2019-05-07 10:55:46 | 20 | 4decfcd0-cca2-4ba5-9f83-a86b8f2a8e4d | cc-compute17-kna1 | 418818 | 0 | 20 | NULL |
| 2018-10-31 12:04:48 | 2019-04-24 08:32:36 | 28 | 266e5266-f811-4b24-949f-3ed9e841c479 | cc-compute10-kna1 | 166818 | NULL | 28 | NULL |
| 2018-11-01 18:59:56 | 2019-06-14 06:29:52 | 34 | 5180de9c-c964-4661-bfbd-893cdfc19f32 | compute25.openstack.local | 271667 | NULL | 34 | NULL |
| 2018-11-01 18:59:56 | 2019-06-14 06:29:47 | 37 | 3a456de2-68ea-4472-95dd-2db1c7b29661 | compute24.openstack.local | 283689 | NULL | 37 | NULL |
| 2019-02-06 19:45:50 | 2019-06-14 06:29:39 | 43 | 0e5e6b94-2992-4075-a922-320bbe8b1bbb | compute26.openstack.local | 165203 | NULL | 43 | NULL |
| 2019-02-06 19:45:50 | 2019-06-14 06:27:26 | 46 | 008c7549-b638-4130-8e79-858556a787c2 | compute27.openstack.local | 166810 | NULL | 46 | NULL |
| 2019-02-10 17:45:03 | 2019-06-14 06:29:16 | 52 | 1fe21d2b-e6f1-4820-b341-a490cf9704d8 | compute29.openstack.local | 161380 | NULL | 52 | NULL |
| 2019-02-10 17:45:03 | 2019-06-14 06:29:08 | 55 | e636f01c-b5da-4886-8a60-1baa5371bcc5 | compute28.openstack.local | 159388 | NULL | 55 | NULL |
| 2019-04-30 09:53:45 | 2019-06-14 06:29:36 | 76 | 34381a1c-1b4e-4716-b7ba-ea72956b92f7 | compute19.openstack.local | 56127 | NULL | 76 | NULL |
| 2019-04-30 13:20:12 | 2019-06-14 06:29:37 | 79 | 946fa4f1-5f1d-47be-b65c-038a7e20c42b | compute06.openstack.local | 56068 | NULL | 79 | NULL |
| 2019-05-08 08:26:45 | 2019-06-14 06:30:01 | 84 | 30a5e17b-96d3-4806-849f-2d814085b130 | compute01.openstack.local | 46162 | NULL | 84 | NULL |
| 2019-05-08 08:27:01 | 2019-06-14 06:29:45 | 87 | 62f85460-4244-429e-9831-357032a8f5e7 | compute17.openstack.local | 46258 | NULL | 87 | NULL |
| 2019-05-13 11:37:50 | 2019-06-14 06:29:36 | 93 | 4e39206e-b00a-41d9-a2d1-a18085a576a7 | compute23.openstack.local | 31555 | NULL | 93 | NULL |
| 2019-05-13 11:37:51 | 2019-06-14 06:29:46 | 96 | 6db0004d-7bcb-4758-accd-52ef580d967b | compute16.openstack.local | 40197 | NULL | 96 | NULL |
| 2019-05-17 11:50:50 | 2019-06-14 06:29:38 | 102 | 18a0a9f5-c9e7-49a2-8e50-d221aec0a9f0 | compute20.openstack.local | 31563 | NULL | 102 | NULL |
| 2019-05-17 11:50:50 | 2019-06-14 06:29:16 | 105 | 97a16a89-055a-4533-86e5-1285ff1911ff | compute07.openstack.local | 31495 | NULL | 105 | NULL |
| 2019-05-29 11:20:15 | 2019-06-14 06:29:05 | 117 | e088c323-c8cb-4dc6-bb11-675a40cd1fcf | compute12.openstack.local | 19449 | NULL | 117 | NULL |
| 2019-05-29 11:20:16 | 2019-06-14 06:29:27 | 120 | 58f85279-1103-42b6-b01d-e1c8de83b8d2 | compute08.openstack.local | 19407 | NULL | 120 | NULL |
| 2019-05-29 11:20:32 | 2019-06-14 06:29:52 | 123 | 58ac9048-eca2-4f51-8d12-b6165f686cf7 | compute05.openstack.local | 19392 | NULL | 123 | NULL |
| 2019-06-11 09:15:59 | 2019-06-14 06:29:29 | 126 | 882f5ad3-f20f-489f-9a20-e2654fcfa925 | compute13.openstack.local | 3873 | NULL | 126 | NULL |
| 2019-06-11 09:16:23 | 2019-06-14 06:29:23 | 129 | 80e266f2-13f2-439c-b04e-736754fd27cd | compute03.openstack.local | 3823 | NULL | 129 | NULL |
| 2019-06-11 09:16:24 | 2019-06-14 06:29:25 | 132 | 09ef46fa-b9e7-429b-8d5b-f4f46ead3c85 | compute11.openstack.local | 3844 | NULL | 132 | NULL |
| 2019-06-12 12:31:49 | 2019-06-14 06:29:08 | 138 | ebc9a09f-08bb-4839-ab56-c4d06bcc6ed4 | vrtx01-lsd01.openstack.local | 362 | NULL | 138 | NULL |
| 2019-06-12 12:32:32 | 2019-06-14 06:29:53 | 141 | d982e5bb-a7d9-40af-b667-43c2f8f2001c | vrtx01-lsd02.openstack.local | 355 | NULL | 141 | NULL |
| 2019-06-13 19:42:01 | 2019-06-14 06:30:00 | 147 | ba89a743-b86f-4bb8-8cfa-3f08fc016c6a | compute15.openstack.local | 612 | NULL | 147 | NULL |
| 2019-06-13 19:42:24 | 2019-06-14 06:29:44 | 150 | 68f6b408-ab9f-4fe7-be9c-7e690086f631 | compute18.openstack.local | 611 | NULL | 150 | NULL |
| 2019-06-13 19:42:24 | 2019-06-14 06:29:21 | 153 | f981737a-d8f8-4b0e-8631-eedb95c85907 | compute22.openstack.local | 592 | NULL | 153 | NULL |
| 2019-06-13 19:42:25 | 2019-06-14 06:29:17 | 156 | d7f483ff-3b91-4d13-9900-0ec24c3a06a4 | compute09.openstack.local | 604 | NULL | 156 | NULL |
| 2019-06-13 19:42:26 | 2019-06-14 06:29:09 | 159 | bc05c643-a2db-442d-b721-39db8665f923 | compute14.openstack.local | 598 | NULL | 159 | NULL |
+---------------------+---------------------+-----+--------------------------------------+------------------------------+------------+----------+------------------+--------------------+
placement returns data on both UUIDs, for example compute18
curl -g -X GET http://*****:8780/resource_providers/dfb35aab-2af9-4d86-bccb-76959c7f68ed/usages -H "Accept: application/json" -H "OpenStack-API-Version: placement 1.2" -H "User-Agent: openstacksdk/0.31.0 keystoneauth1/3.14.0 python-requests/2.22.0 CPython/2.7.12" -H "X-Auth-Token:******" | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 98 100 98 0 0 2648 0 --:--:-- --:--:-- --:--:-- 2648
{
"resource_provider_generation": 435686,
"usages": {
"DISK_GB": 150,
"MEMORY_MB": 9728,
"VCPU": 7
}
}
curl -g -X GET http://*****:8780/resource_providers/68f6b408-ab9f-4fe7-be9c-7e690086f631/usages -H "Accept: application/json" -H "OpenStack-API-Version: placement 1.2" -H "User-Agent: openstacksdk/0.31.0 keystoneauth1/3.14.0 python-requests/2.22.0 CPython/2.7.12" -H "X-Auth-Token:*****" | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 97 100 97 0 0 740 0 --:--:-- --:--:-- --:--:-- 740
{
"resource_provider_generation": 664,
"usages": {
"DISK_GB": 680,
"MEMORY_MB": 59392,
"VCPU": 32
}
}
i am speculating heavily on the cause of the issue however other symptoms we have seen
- live migration fails as no suitable host found (despite near empty nodes)
- new VMs fail to spawn as no suitable host found (despite near empty nodes)
these issues lead us to have to continually live migrate VMs to get
some load balancing
other potentially useful input (or separate bugs)
nova-compute.log often has
2019-02-07 13:37:59.362 2632 INFO nova.compute.resource_tracker [req-
e0f53ec7-7668-4a64-8ba6-ead35f168e82 - - - - -] Instance
4fba72d0-2e95-4b92-b0f6-a7853dc3e8bd has allocations against this
compute host but is not found in the database.
we find this in normal running, but also have found it in relation to
live migrations which have failed and have not been rolled back (for
example as a result of the port_binding error)
it is also possible to get multiple entries in the services table, though I don't believe this is related, and will be reported in a separate bug
MariaDB [nova]> select host, services.binary, version from services where host="cc-compute01-kna1"
-> ;
+-------------------+--------------+---------+
| host | binary | version |
+-------------------+--------------+---------+
| cc-compute01-kna1 | nova-compute | 35 |
| cc-compute01-kna1 | nova-compute | 0 |
+-------------------+--------------+---------+
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1832814/+subscriptions
References