← Back to team overview

openstack team mailing list archive

Re: New nova service proposal

 

On Aug 26, 2011, at 2:22 PM, Ed Leafe wrote:

> 	Sorry I haven't come up with a snazzy name for it yet, but what I have in mind is a new service that is essential for my employer (Rackspace), and might be important for other OpenStack deployments. This new service would be completely optional, of course - only those for whom it is relevant would run it.
> 
> 	Let me start by stating the problem: when a customer requests that we create instances for them, nova casts those requests into the queue, where they are eventually acted upon. That usually works great, but in cases where the instance creation fails, we need to detect that failure and re-issue the create request with a different host. This is currently not possible with the asynchronous design of the compute-scheduler interactions.
> 
> 	So what I envision is a service that scans a list of recent requests' reservation IDs, and follows up to see if the request was successful or not, and takes action if needed. The blueprint for this can be found at https://blueprints.launchpad.net/nova/+spec/instance-creation-assurance, with an Etherpad created for ongoing idea exchange at http://etherpad.openstack.org/instance-creation-assurance

Hmmm.. having looked over this, I agree that we need to have a way to retry failed builds, however I do not think that having another service essentially polling the builds to find failures is the right way to go. 

First off, I think it would be better if whatever had the failure responded by sending a request somewhere (a cast) to say "Hey, this bombed. Retry it. "  I wouldn't try doing calls instead of casts, as some have suggested, for performance reasons. (and I could see deadlocking issues) 

If we step back and look at this, these requests/orders/whatever you call it amount to multi-step workflows.  Even for building a single server you have things like "allocate this instance on a hypervisor", "Assign IP's" "Attach these volumes",  any of which could fail for some reason.   And if they do fail, there may be steps need to back-out already completed work. 

The proper way, IMHO, for this to work is that a request generates a workorder with a set of tasks.  
This gets sent to something (scheduler, probably) which looks at the first uncompleted task on the workorder, makes the decision on where to send it, and routes the whole workorder there.  
The service that gets it performs the task (i.e. executes the method), possibly attaching  info (like id of newly created instance) to the workorder, and possibly pushing an 'undo' task to the top of a list of tasks to perform if things fail somewhere.  
Then the whole workorder gets sent back to the origin (again, scheduler?) This looks at the next uncompleted task, and starts the cycle again.  
Repeat until done. 

If there is a failure, the scheduler works through the 'undo' list on the workorder, and then makes whatever decisions are needed to redo the tasks elsewhere.  The workorder contains the record of the failed attempt, so it doesn't, for example, try to send the server build back to the same hosts that just failed. 

The workorder acts as an environment for the tasks, and could be passed to tasks (rpc methods) as an attribute of the context object. 


Anyway, that is my notion.  Flame away. 



--
	Monsyne M. Dragon
	OpenStack/Nova 
	cell 210-441-0965
	work x 5014190

This email may include confidential information. If you received it in error, please delete it.



Follow ups

References