← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1861067] [NEW] [Ocata]resource tracker does not validate placement allocation

 

Public bug reported:

For stable/ocata, we got serious scheduler problem makes us to upgrade
to upper release. I could not find any issue report for that so leave it
for whom meet this issue later.

The problem which we encounter is like this
- conductor try to schedule one compute nodes for 2 instances
- nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute
- resource tracker in nova-compute claim for resource to placement
- placement returns for the answer of one of the request 409, since there were several concurrent requests.
- [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance.
- After that compute_nodes in scheduler was full but allocation in placement has slot to be used.
- [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning.
- OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect)

I found it's already fixed over pike in which scheduler make allocation
first and nova-compute just checks the compute_nodes. But for me, it's
very hard to find root cause and need to investigate a lot for scheduler
history, so I hope someone who meet this problem would be helpful.

I do not sure it should be fixed since ocata is quite old though, we can
fix it up to change the function (nova/scheduler/client/report.py
_allocate_for_instance()) to catch the 409 conflict similar to the
function latter added (put_allocations())

Thanks.

** Affects: nova
     Importance: Undecided
         Status: New

** Description changed:

- For ocata/stein, we got serious scheduler problem makes us to upgrade to
- upper release. I could not find any issue report for that so leave it
+ For stable/ocata, we got serious scheduler problem makes us to upgrade
+ to upper release. I could not find any issue report for that so leave it
  for whom meet this issue later.
  
  The problem which we encounter is like this
  - conductor try to schedule one compute nodes for 2 instances
  - nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute
  - resource tracker in nova-compute claim for resource to placement
  - placement returns for the answer of one of the request 409, since there were several concurrent requests.
  - [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance.
  - After that compute_nodes in scheduler was full but allocation in placement has slot to be used.
  - [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning.
  - OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect)
  
  I found it's already fixed over pike in which scheduler make allocation
  first and nova-compute just checks the compute_nodes. But for me, it's
  very hard to find root cause and need to investigate a lot for scheduler
  history, so I hope someone who meet this problem would be helpful.
  
  I do not sure it should be fixed since ocata is quite old though, we can
  fix it up to change the function (nova/scheduler/client/report.py
  _allocate_for_instance()) to catch the 409 conflict similar to the
  function latter added (put_allocations())
  
- 
  Thanks.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1861067

Title:
  [Ocata]resource tracker does not validate placement allocation

Status in OpenStack Compute (nova):
  New

Bug description:
  For stable/ocata, we got serious scheduler problem makes us to upgrade
  to upper release. I could not find any issue report for that so leave
  it for whom meet this issue later.

  The problem which we encounter is like this
  - conductor try to schedule one compute nodes for 2 instances
  - nova-compute at that time has enough resource in compute_nodes, scheduler choose the nova-compute
  - resource tracker in nova-compute claim for resource to placement
  - placement returns for the answer of one of the request 409, since there were several concurrent requests.
  - [BUG here] resource tracker in nova-compute does not care about the return code from placement, so 'allocation' is only increased for share of the one instance.
  - After that compute_nodes in scheduler was full but allocation in placement has slot to be used.
  - [User meet weirdness here] since there were slot to be used in scheduler side, instance could be made in compute node which is actually full. The result is that compute node is over provisionning.
  - OOM occurs. (We got tight memory, if admin has other resource policy, they would be meet different side effect)

  I found it's already fixed over pike in which scheduler make
  allocation first and nova-compute just checks the compute_nodes. But
  for me, it's very hard to find root cause and need to investigate a
  lot for scheduler history, so I hope someone who meet this problem
  would be helpful.

  I do not sure it should be fixed since ocata is quite old though, we
  can fix it up to change the function (nova/scheduler/client/report.py
  _allocate_for_instance()) to catch the 409 conflict similar to the
  function latter added (put_allocations())

  Thanks.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1861067/+subscriptions


Follow ups