openstack team mailing list archive

Thread
Date
Re: DIstributed Scheduler blueprint update

To: Ed Leafe <ed@xxxxxxxxx>
From: Justin Santa Barbara <justin@xxxxxxxxxxxx>
Date: Thu, 10 Mar 2011 10:25:50 -0800
Cc: Openstack <openstack@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <847E2F37-64A5-4436-B90E-78CBC3541FE1@leafe.com>
I have implemented a (single-node) constraint-based / rules-based scheduler
that attempts to find a "good" solution to potentially conflicting rules.  I
used it to implement eday's "openstack:location=machine1.rack1.room1.dfw"
type pragma that we discussed in the past.  I think this could helpful for
what you're describing here, so I encourage you to check it out:
https://code.launchpad.net/~justin-fathomdb/nova/constraint-scheduler

(I broke the unit tests while implementing the directed-location constraint
in the derived branch, I'm going to fix that today)

As for the distributed scheduling approach, I like it.  I'd like to focus
first on the conceptual approach:

   - A scheduler receives a "allocation" request.
   - It evaluates it against all local providers, giving each one a "score"
   - It collects the responses from recursively sending the request to any
   child schedulers
   - It aggregates all these responses and selects the highest scoring node
   - It sends a "go-with-allocate" to the appropriate child scheduler (or
   does it locally)
   - If the selected node is no longer available, we start again


Now, there are several optimizations:

   - We can use a "clever" solver such that we don't need to evaluate
   against every local node (this is in my branch)
   - We may only return the top N solutions to minimize the data we pass
   around
   - We can try to limit the number of child schedulers to whom we forward
   the request:
      - We may use zones information to rule out a child entirely
      - We may use other static information to rule out a child (e.g. a
      particular child might not be "HIPAA compliant")
      - We may choose to do this heuristically, for example, sending to a
      first child, and then considering whether we have found a 'good enough'
      response to stop polling children
      - We may choose to forward to the "most likely candidate" child
      schedulers first, to make the previous optimization more valid
   - We may use a threshold search i.e. send a first request with a "perfect
   matches only" criteria, and then gradually repeat with more and more relaxed
   criteria

All standard CS stuff.  However, we should not let the existence of the
optimizations get in the way of implementing a "brute force" implementation
first, where every request is fully evaluated on every provider in the
entire scheduler tree.  For Cactus-size deployments, this will still be more
than fast enough, and we should be able to get it merged in time.  When we
launch an instance we're probably going to be copying a gigabyte of data
around, so these optimizations really aren't too important in that light.
 This also structures our development - the optimizations can be implemented
in separate manageable patches.

This also helps me think about what data the parent schedulers need about
their children.  In the "brute force" implementation schedulers only need
the list of their children.  If we want to start to do filtering of
requests, schedulers need appropriate static metadata (child zone
information, HIPAA compliance).  Dynamic information (e.g. real-time
availability) may be used to intelligently order the child requests, but it
shouldn't matter if this dynamic information is out of date, because we're
already bailing out when something is 'good enough' and not really looking
for the 'optimal node'.  Out-of-date dynamic information will reduce
efficiency (I may poll the wrong child first) but should not affect
correctness.  But for Cactus, I just need the list of my children.

What I would really like to see is the ability to use the scheduler to
combine clouds not under the same control.  For example, a private cloud
could burst onto one or more public clouds; all under the control of a local
scheduler.  This needs a few things:

   1. The schedulers should communicate with each other over HTTP, and can't
   really use the message queues because of the tight coupling needed
   2. The public API interface should expose the same HTTP interface, so
   that it can be used as a child scheduler
   3. We obviously can't rely on a centralized database

(I don't really understand where the need (or desire) for a centralized
database comes from ?)

Of course, we won't get to my multi-cloud dream in Cactus, because we have
to discuss it and not just implement it.  Nonetheless, I see this approach
as (1) similar to what you're suggesting, (2) simplifying the coding work,
and (3) taking us to a great place.


Justin
References

Re: DIstributed Scheduler blueprint update
From: ksankar, 2011-03-10
Re: DIstributed Scheduler blueprint update
From: Ed Leafe, 2011-03-10