← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1671648] Re: Instances are not rescheduled after deploy fails

 

Reviewed:  https://review.openstack.org/444106
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cb4ce72f5f092644aa9b84fa58bcb9fd89b6bedc
Submitter: Jenkins
Branch:    master

commit cb4ce72f5f092644aa9b84fa58bcb9fd89b6bedc
Author: ShunliZhou <slzhou@xxxxxxxxxxxxx>
Date:   Fri Mar 10 14:05:57 2017 +0800

    Add populate_retry to schedule_and_build_instances
    
    When boot an instance and failed on the compute node, nova will
    not retry to boot on other host.
    
    Since https://review.openstack.org/#/c/319379/ change the create
    instance workflow and called schedule_and_build_instances which
    not populate the retry into filter properties. So nova will not
    retry when boot on compute fail. This patch populate retry to
    instance properties when call schedule_and_build_instances.
    
    Change-Id: Ifdaddcd265a7fe8282499e27043936f8212610ad
    Closes-Bug: #1671648


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1671648

Title:
  Instances are not rescheduled after deploy fails

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) ocata series:
  In Progress

Bug description:
  Steps to reproduce:
  Pre-step. Need to force the deploy to fail in such a way that it can be rescheduled. For testing I just forced it to fail by adding raise nova.exception.ComputeResourcesUnavailable('forced failure') during the instance spawn on the host.
  1. Make sure environment is set to retry failed deploys.
  2. Attempt to deploy VM and wait for it to fail.

  Expected result:
  Failed instance is rescheduled and attempted on another host.

  Actual result:
  Deploy fails but is not rescheduled.


  I am just beginning to experiment with ocata build from early March. I
  found that when an instance fails to deploy and throws a
  RescheduledException, it is not getting rescheduled as expected. The
  problem appears to be that the filter_properties['retry'] is not
  getting set during initial deploy.

  On initial deploy
  nova.conductor.manager.schedule_and_build_instances() schedules the
  build_request and creates the instance object. That method also
  creates the filter properties (filter_props) that is passed on to
  compute_rpcapi.build_and_run_instance(). The problem is that
  scheduler_utils.populate_retry() is not called before the filter_props
  is passed on to the build call. When the deploy later fails on the
  host nova.compute.manager._do_build_and_run_instance() catches the
  RescheduledException but does not try and reschedule it because
  filter_properties.get('retry') returns None.

  In the past it looks like populate_retry() was called in by
  nova.conductor.manager.build_instances() during the initial deploy.
  I'm not seeing build_instances() get called during initial deploy
  after switching to ocata. As an experiment I added
  scheduler_utils.populate_retry(filter_props,
  build_request.instance_uuid) immediately after filter_props is set in
  schedule_and_build_instances(). Afterward I do see the instances get
  rescheduled. I also noticed nova.conductor.manager.build_instances()
  gets called for each attempt after the first.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1671648/+subscriptions


References