nova-scaling team mailing list archive

Thread
Date

Scaling thoughts

To: "nova-scaling@xxxxxxxxxxxxxxxxxxx" <nova-scaling@xxxxxxxxxxxxxxxxxxx>
From: Chris Behrens <chris.behrens@xxxxxxxxxxxxx>
Date: Fri, 11 Nov 2011 09:30:43 +0000
Accept-language: en-US
Cc: Chris Behrens <chris.behrens@xxxxxxxxxxxxx>
Thread-index: AQHMoFSVDLLQLze0FEKHtclbFCcHYw==
Thread-topic: Scaling thoughts

Hi All,

Sorry for the delay in sending anything here. Things have been quite busy. :) I've been having some internal discussions about this... but it's time to get the ball rolling here!

Sandy, thanks for sending the note a couple days ago about ETL. I'm going to talk a bit about that and some other thoughts below. I appreciate any feedback people may have on these thoughts.

Zone communication:
-----------------------------
Today, all zone communication is done via the OS API and it is all parent to child. This has worked fine so far, but I feel we need a bit more flexibility and I think we can avoid some headaches the API has caused by moving away from it. I think for a lot of people, zone communication is going to be in a 'trusted' environment. I know that the federated/bursting model is important to the community, but I'm going to set it aside for a moment. I feel we need to focus on getting the 'trusted' environment working well first.

First, the headaches I mention by using OS API:

1) Unneeded re-authentication in child zones
a) A pain because of keystone changes
b) Extra latency and a waste of cycles to re-authenticate in a child when the parent has already done so.
2) Using novaclient within nova
3) Extensions are needed to support methods that don't exist in the main API
4) Responses to queries are already formatted for an API user, not for internal use.
a) We have a few '_is_precooked' hacks because of this, so the parent API doesn't re-format things.
b) This also means if we do a 'get' on an instance, we don't get the full Instance model. We're limited by what the OS API spec says to return.
c) Error handling is limited by HTTP error codes and strings.
5) Pushing information from child zones to parent zones means child zones will need to know parent API endpoints and credentials.
6) Scaling child zones means the need for a pair of load balancers in front of a number of API nodes.

A logical alternative in a trusted environment is to use AMQP for communication inter-zone like we do intra-zone. Most of the above problems go away.

1) Re-authentication of nova credentials is no longer needed. (However, AMQP does have its own authentication)
2) No need for novaclient. We will need to modify our RPC code to support multiple AMQP servers, though.
3) New functionality is as simple as defining a new method in the RPC Proxy object like intra-zone communication.
4) Data can be returned natively and we can define what we need to return.
5) Children can push events into queues for their parents.
6) Scaling child zones is as simple as setting up more queue workers.

Additionally, queues in general would come in handy for asynchronous events.

I'll wait for some feedback on this before I post some thoughts on what the queues look like, which AMQP servers I think the queues live in, and what is listening, etc. I would like to move on this stuff pretty quickly, though, and get a blueprint created for the work. But, I do want to make it clear I'm not talking about clustering rabbit servers or anything here. Rabbit servers would still remain completely independent.

Instance caching:
------------------------
As Sandy mentioned in his email a couple of days ago, we currently have a share-nothing architecture. This doesn't scale, at least in a very large completely trusted environment. The 'nova list' problem is a big example. We need to have a cache of instances from all zones at the API endpoint.

Just listing a few important requirements that come to mind:

1) 'nova list' should be fairly fast. (I realize this is subjective, but querying all zones in parallel for instances is not 'fairly fast')
2) 'nova list' should always return all instances consistently, even if a child zone is down. Stale 'status' is okay, however.
3) Doing a GET on an instance should function properly immediately after creating it.
4) We need to quickly be able to know which zone an instance is in.

ETL as a solution is interesting, but I have a lot of reservations about it. Sandy raised a number of the concerns in his email, so I won't recount them here. But, in addition to those, it also feels slightly overkill and 'beastly'. It, alone, doesn't give us a complete solution, either. Ie, #3 above will fail because of the 'lag time'.

We have some stuff to figure out here. Random thoughts:

1) OS API says we have to return a UUID and root password for a build. Today, the API is blocking, waiting for the scheduler to pick a child zone and create the Instance record. We need to remove this blocking. That means we need to generate the UUID and root password up at the top and pass it to child zones. The top level also needs to create a cache entry for the instance.... but then it can return immediately after casting the build request to the scheduler. This satisfies the #1 thru #3 requirements above.
2) Most other things can be pushed up from child zones to parents. Events can be generated on:
a) Zone and host being picked for an instance
b) State changes for an instance
3) Need to think about out-of-sync issues.
a) We may want hosts to periodically push up information about instances, even if their state hasn't changed.
b) A "GET" on an instance could use the cache, but fall back to querying the zone it's in if the last known state is too old. The cache can be updated "on the way back out"

I think we can come to a relatively simple solution that works by tweaking the above a bit, avoiding the complexities of ETL. Even with ETL, we'd need part of the above, anyway (to solve requirement #3).

Sandy, Jesse, and myself are meeting in person in a couple weeks to try to nail down a real plan to present. Would love some feedback before then.

Scheduling:
----------------
This relates to caching above. Right now for a build, we query all zones in parallel and get costs of all hosts. The top level scheduler ends up picking the host. We need to eliminate these parallel queries to all zones. I don't feel that the top level should actually pick the host in a child zone. The top level should have enough information to route the request to a specific zone only. That zone can then determine which host on which to put the instance. A lot of this can probably be solved by pushing events up from child zones like I mention above under caching. I think the minimum that the top level needs to know is some general idea of capacity in a zone. Rackspace does have additional requirements based on where an instance for a customer was last placed and so forth, however. This is another topic up for discussion, and I still think ETL is overkill here. With or without ETL, there's going to be race conditions. The top level will never have a consistent view of its children without locking around everything... and that's not an option for a scalable solution. Orchestration has to get tied in here when that's resolved. We're going to need retries on some level. I think we can ignore the race conditions for now, as most should really only surface when a zone gets low on capacity. I don't have a lot more detail to add right now.. but it's another thing I'd like to get nailed down in a couple weeks.

Federation/Bursting:
----------------------------
So, I'll come back to this briefly. AMQP for communication will not be an option here as one wouldn't want to open up their rabbit server(s) to outsiders. We could still continue to use the OS API for communication, but I don't really like it due to the above 'headaches' I list. My initial thoughts are that it might be better to create a separate REST service that has an 'internal' API. When thinking about AMQP, I've been debating whether or not a zone-manager service makes sense, instead of communication directly from a parent zone to a child zone scheduler. If we were to create a zone-manager type service, it could listen to AMQP as well as have a REST interface. (I suppose the scheduler could have a REST interface, too, but it feels slightly more clean to have this as its own service.) Thoughts on OS API vs these other ideas this would be welcomed. :)

Caching becomes a problem with federation... as I'm not sure how we could push events up. ETL isn't a solution, either. I think we'll have to rely on "on the way back out" type caching....and TTLs on the cache to force a query periodically. A 'nova list' in this situation is kinda scary. I don't think it can ever be as efficient as in a trusted environment.

But as I said above... I think we should focus on the trusted environment first. I don't want to completely forget about federation, but if we can't get trusted working efficiently, federation becomes kind of pointless. :)

Other:
--------
We have a requirement that instance_type IDs are the same in all zones. We need to figure out something here. And nova scaling also has to tie in with glance scaling. I think glance scaling is beyond scope for us here, but... every zone does need to talk to glance servers that have the same image hrefs for images.
--------

Anyway, that's it for now. I probably missed some things and some of my own thoughts that will come out later. If you made it to the end, congratulations! :) Let's hear your thoughts..

- Chris