yahoo-eng-team team mailing list archive

Thread
Date
[Bug 1787606] [NEW] Multi instance creation rescheduling fails due to a lack of alternates

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Lee Yarwood <lyarwood@xxxxxxxxxx>
Date: Fri, 17 Aug 2018 14:57:24 -0000
Reply-to: Bug 1787606 <1787606@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx
Public bug reported:

Description
===========

When creating more than a single instance in the same request the filter
scheduler will skip any host that has already been selected when
attempting to find alternates. The lack of alternates will lead to
instances not being rescheduled and entering an ERROR state if issues
are encountered when spawning on their selected host.

For example, given a simple two node environment and a request to create
5 instances the following nested lists of selections is returned:

[
[Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com')], 

[Selection(allocation_request='{"allocations": {"9fa912ad-4b6a-478f-
b2dc-aa305b552d64": {"resources": {"VCPU": 1, "MEMORY_MB":
512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90
-a01f-680477646b84,compute_node_uuid=9fa912ad-4b6a-478f-b2dc-
aa305b552d64,limits=SchedulerLimits,nodename='host2.example.com',service_host='host2.example.com')],

[Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-
492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB":
512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90
-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-
492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com')],

[Selection(allocation_request='{"allocations": {"9fa912ad-4b6a-478f-
b2dc-aa305b552d64": {"resources": {"VCPU": 1, "MEMORY_MB":
512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90
-a01f-680477646b84,compute_node_uuid=9fa912ad-4b6a-478f-b2dc-
aa305b552d64,limits=SchedulerLimits,nodename='host2.example.com',service_host='host2.example.com')],

[Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com')]
]

The above lists a single selection for each instance being created with
no alternates present. Compare that to the following list from a request
to create a single instance:

[
[Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com'),
Selection(allocation_request='{"allocations": {"9fa912ad-4b6a-478f-b2dc-aa305b552d64": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=9fa912ad-4b6a-478f-b2dc-aa305b552d64,limits=SchedulerLimits,nodename='host2.example.com',service_host='host2.example.com')]
]

Here we have two selections, the original selected host and an
alternate. AFAICT the following conditional is at fault here as it
currently checks if the potential alternate has been selected for any
other instance within the request:

https://github.com/openstack/nova/blob/83574f7e07f6a67b09226971dd8fb0ed5436f86e/nova/scheduler/filter_scheduler.py#L400

Steps to reproduce
==================

* Launch more than one instance in a single request using min_count/max_count.
* Ensure instances are unable to spawn on at least one compute host.

Expected result
===============

Instances that are unable to spawn on one compute host are rescheduled
elsewhere.

Actual result
=============

Instances that are unable to spawn on one compute host are not
rescheduled and end up in an ERROR state.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   master
   
2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1787606

Title:
  Multi instance creation rescheduling fails due to a lack of alternates

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========

  When creating more than a single instance in the same request the
  filter scheduler will skip any host that has already been selected
  when attempting to find alternates. The lack of alternates will lead
  to instances not being rescheduled and entering an ERROR state if
  issues are encountered when spawning on their selected host.

  For example, given a simple two node environment and a request to
  create 5 instances the following nested lists of selections is
  returned:

  [
  [Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com')], 

  [Selection(allocation_request='{"allocations": {"9fa912ad-4b6a-478f-
  b2dc-aa305b552d64": {"resources": {"VCPU": 1, "MEMORY_MB":
  512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-
  4c90-a01f-680477646b84,compute_node_uuid=9fa912ad-4b6a-478f-b2dc-
  aa305b552d64,limits=SchedulerLimits,nodename='host2.example.com',service_host='host2.example.com')],

  [Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-
  492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB":
  512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-
  4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-
  492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com')],

  [Selection(allocation_request='{"allocations": {"9fa912ad-4b6a-478f-
  b2dc-aa305b552d64": {"resources": {"VCPU": 1, "MEMORY_MB":
  512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-
  4c90-a01f-680477646b84,compute_node_uuid=9fa912ad-4b6a-478f-b2dc-
  aa305b552d64,limits=SchedulerLimits,nodename='host2.example.com',service_host='host2.example.com')],

  [Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com')]
  ]

  The above lists a single selection for each instance being created
  with no alternates present. Compare that to the following list from a
  request to create a single instance:

  [
  [Selection(allocation_request='{"allocations": {"3f4bda1d-13ab-492b-9100-bf585c361170": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=3f4bda1d-13ab-492b-9100-bf585c361170,limits=SchedulerLimits,nodename='host1.example.com',service_host='host1.example.com'),
  Selection(allocation_request='{"allocations": {"9fa912ad-4b6a-478f-b2dc-aa305b552d64": {"resources": {"VCPU": 1, "MEMORY_MB": 512}}}}',allocation_request_version='1.25',cell_uuid=0e0078e9-420b-4c90-a01f-680477646b84,compute_node_uuid=9fa912ad-4b6a-478f-b2dc-aa305b552d64,limits=SchedulerLimits,nodename='host2.example.com',service_host='host2.example.com')]
  ]

  Here we have two selections, the original selected host and an
  alternate. AFAICT the following conditional is at fault here as it
  currently checks if the potential alternate has been selected for any
  other instance within the request:

  https://github.com/openstack/nova/blob/83574f7e07f6a67b09226971dd8fb0ed5436f86e/nova/scheduler/filter_scheduler.py#L400

  Steps to reproduce
  ==================

  * Launch more than one instance in a single request using min_count/max_count.
  * Ensure instances are unable to spawn on at least one compute host.

  Expected result
  ===============

  Instances that are unable to spawn on one compute host are rescheduled
  elsewhere.

  Actual result
  =============

  Instances that are unable to spawn on one compute host are not
  rescheduled and end up in an ERROR state.

  Environment
  ===========
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/

     master
     
  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     Libvirt + KVM

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

     N/A

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

     N/A

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1787606/+subscriptions
Follow ups

[Bug 1787606] Re: Multi instance creation rescheduling fails due to a lack of alternates
From: Lee Yarwood, 2020-03-16