yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #81436
[Bug 1861067] [NEW] [Ocata]resource tracker does not validate placement allocation
Public bug reported:
For stable/ocata, we got serious scheduler problem makes us to upgrade
to upper release. I could not find any issue report for that so leave it
for whom meet this issue later.
The problem which we encounter is like this
- conductor try to schedule one compute nodes for 2 instances
- nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute
- resource tracker in nova-compute claim for resource to placement
- placement returns for the answer of one of the request 409, since there were several concurrent requests.
- [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance.
- After that compute_nodes in scheduler was full but allocation in placement has slot to be used.
- [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning.
- OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect)
I found it's already fixed over pike in which scheduler make allocation
first and nova-compute just checks the compute_nodes. But for me, it's
very hard to find root cause and need to investigate a lot for scheduler
history, so I hope someone who meet this problem would be helpful.
I do not sure it should be fixed since ocata is quite old though, we can
fix it up to change the function (nova/scheduler/client/report.py
_allocate_for_instance()) to catch the 409 conflict similar to the
function latter added (put_allocations())
Thanks.
** Affects: nova
Importance: Undecided
Status: New
** Description changed:
- For ocata/stein, we got serious scheduler problem makes us to upgrade to
- upper release. I could not find any issue report for that so leave it
+ For stable/ocata, we got serious scheduler problem makes us to upgrade
+ to upper release. I could not find any issue report for that so leave it
for whom meet this issue later.
The problem which we encounter is like this
- conductor try to schedule one compute nodes for 2 instances
- nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute
- resource tracker in nova-compute claim for resource to placement
- placement returns for the answer of one of the request 409, since there were several concurrent requests.
- [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance.
- After that compute_nodes in scheduler was full but allocation in placement has slot to be used.
- [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning.
- OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect)
I found it's already fixed over pike in which scheduler make allocation
first and nova-compute just checks the compute_nodes. But for me, it's
very hard to find root cause and need to investigate a lot for scheduler
history, so I hope someone who meet this problem would be helpful.
I do not sure it should be fixed since ocata is quite old though, we can
fix it up to change the function (nova/scheduler/client/report.py
_allocate_for_instance()) to catch the 409 conflict similar to the
function latter added (put_allocations())
-
Thanks.
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1861067
Title:
[Ocata]resource tracker does not validate placement allocation
Status in OpenStack Compute (nova):
New
Bug description:
For stable/ocata, we got serious scheduler problem makes us to upgrade
to upper release. I could not find any issue report for that so leave
it for whom meet this issue later.
The problem which we encounter is like this
- conductor try to schedule one compute nodes for 2 instances
- nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute
- resource tracker in nova-compute claim for resource to placement
- placement returns for the answer of one of the request 409, since there were several concurrent requests.
- [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance.
- After that compute_nodes in scheduler was full but allocation in placement has slot to be used.
- [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning.
- OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect)
I found it's already fixed over pike in which scheduler make
allocation first and nova-compute just checks the compute_nodes. But
for me, it's very hard to find root cause and need to investigate a
lot for scheduler history, so I hope someone who meet this problem
would be helpful.
I do not sure it should be fixed since ocata is quite old though, we
can fix it up to change the function (nova/scheduler/client/report.py
_allocate_for_instance()) to catch the 409 conflict similar to the
function latter added (put_allocations())
Thanks.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1861067/+subscriptions
Follow ups