← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1276214] Re: Live migration failure in API doesn't revert task_state to None

 

Reviewed:  https://review.openstack.org/168916
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f2a1f00829e849e78f850a73489864e57cbd86b3
Submitter: Jenkins
Branch:    master

commit f2a1f00829e849e78f850a73489864e57cbd86b3
Author: Maciej Szankin <maciej.szankin@xxxxxxxxx>
Date:   Fri Feb 26 11:18:51 2016 +0100

    Live migration failure in API leaves VM in MIGRATING state
    
    When nova-api calls nova-conductor a RPC MessagingTimeout might
    occur. In such case we shouldn't leave VM in MIGRATING state. Possible
    scenarios are:
    
    * nova-conductor received message but failed to respond, no additional
    exceptions raised - live migration will start, VM will be moved to
    destination host
    * nova-conductor received message but failed to respond, additional
    exception raised (e.g., LibvirtError) - LM will not start
    * nova-api couldn't reach nova-conductor - LM will not start
    
    Because we can't predict in API layer what happened below, this patch
    writes instance fault to database when MessagingTimeout is caught.
    
    Co-Authored-By: Pawel Koniszewski <pawel.koniszewski@xxxxxxxxx>
                    Bartosz Fic <bartosz.fic@xxxxxxxxx>
    Closes-Bug: #1276214
    Change-Id: Id800e925fbb689d20e7907b698b67c92fd3da979


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1276214

Title:
  Live migration failure in API doesn't revert task_state to None

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  If API times out on a RPC during the processing of a migrate_server it
  does not revert the task_state back to NULL before or after sending
  the error response back to the user. This can prevent further API
  operations on the VM and leave a good VMs in non-operable state with
  the exception of perhaps a delete.

  This is one possible reproducer. I'm not sure if this is always true,
  and I'd appreciate if someone else confirm it.

  1. Somehow make RPC requests hang
  2. Issue a live migration request
  3. The call should return an HTTP error (409 perhaps)
  4. Check VM. It should be in a good state but the task_state stuck in 'migrating'

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1276214/+subscriptions


References