← Back to team overview

nova team mailing list archive

Re: Data architecture

 

We started with an architecture similar to this.  I think it definitely has
merit, but I don't see any real advantages over a centralized data store.
 Can you be more specific about your concerns with response time and high
availability?  There is no reason you can't cache data locally in either
scenario.  It seems this method just shifts availability concerns from the
datastore to the queue and the host machines.

What happens when a host dies?  There is no way to get any information about
the instances that were on that host unless you have it in cache.  You
clearly need some kind of out-of-process storage for data, so why have 1
million tiny databases connected by a complex message passing system, when
you can put them together and have the advantage of querying accross
relations?

I like the centralized datastore because it is really simple.  It seems like
a relational database is a good fit for the amount of data we are expecting,
even with the 1,000,000 host machine 60,000,000 instances we are hoping to
support.  Redis is really fun to use, and the atomic operations with sets
are beautiful, but I think relational is the way to go for this problem set.

Vish

On Tue, Aug 3, 2010 at 6:31 PM, Eric Day <eday@xxxxxxxxxxxx> wrote:

> Hi everyone,
>
> There was quite a bit discussion on IRC today about switching from
> Redis to a SQL based database, and which ORM to choose. Before
> going down that path, I think we need to take a look at the overall
> architecture we're moving towards, identify our data storage needs,
> and then choose based on those requirements.
>
> I'm not convinced that the shared data storage solution that exists
> today with Redis (or any other back-end) is optimal. Here is the
> architecture as it exists today, minus a few details:
>
> http://oddments.org/tmp/architecture_old.png
>
> The API servers poke into the data store to help answer some queries,
> along with the compute workers and volume workers. I know Vish is
> reworking the network services, and I'm guessing that will be using
> the data store as well.
>
> My concern here is that while Redis (or other database solutions)
> can scale quite easily for our data set size, there are still response
> time and HA concerns. You can use local caching to help response time
> and run Redis slaves with promotion for HA, but this still seems a
> bit clunky. One of our design tenets is shared-nothing, but right
> now the data store is shared everywhere.
>
> Below is a proposal that came after many discussions with various
> folks starting in Austin a few weeks ago. Paul Voccio and I drafted
> up a few of these ideas on paper during OSCON. The idea is to let the
> host machine be the canonical store for all data, and any information
> the API layer (or new scheduler layer) needs can be pushed up when
> changed. The new architecture would look something like this:
>
> http://oddments.org/tmp/architecture_new.png
>
> Beyond the data changes, this diagram also introduces the scheduler,
> pluggable image service, and a concept of a cluster. None of these
> are required for the data model change but are included for now.
>
> A cluster is just a grouping mainly for networking purposes. For
> example, all hosts in a cluster may be on the same switch or
> are in the same subnet. Large installations under some network
> configurations needs this level of grouping. While the message
> queue for each cluster is displayed as being separate, this is not
> required. For small installations you could use a single message
> queue for everything. The separation is there if needed because you
> may have enough hosts (which translate to at least one queue topic)
> to overload a single message queue.
>
> The scheduler is introduced here and can coordinate requests between
> the API servers and the different clusters. This is especially
> important for build requests where you need to coordinate a good fit
> for new images based on a number of parameters. The image worker is
> pluggable and could be provided by the local filesystem, an object
> store (like swift), or anything else. The host, volume, and network
> workers are almost identical to what exists in the current system.
>
> Getting back to the data storage model, note that there is no
> shared storage between any of the services. Each host worker would
> maintain all information needed for each guest it is running either
> in the hypervisor or through another local data store. This data can
> be backed up to another services (like redis, postgresql, mysql,
> ...) but the host will be the only one service accessing it. The
> backup data store could of course be shared between many hosts as well.
>
> The scheduler (and possibly the API servers) keep a subset of the host
> data stored in memory so they can answer requests or route requests to
> the appropriate host quickly. These processes can keep snapshots of
> this data so it doesn't need to pull the entire data set on restart,
> and for any new changes it can requests events since the last known
> timestamp. This does add some complexity to the data model since
> we'll need to push events up from the hosts based on timestamps,
> but it doesn't rely on any shared storage.
>
> Other services like network and volume can also maintain their own
> data stores and process requests much like the hosts do. For example,
> when a new guest instance is being provisioned, it will hit the network
> workers for IP assignment before reaching the host to be provisioned.
>
> Here is the flow of a guest creation request:
>
> * The HTTP request is verified and converted to an internal message
>  format. This is just a set of key/value pairs representing options
>  for the request (image, size, ...). Once the message is constructed
>  it is sent to the image worker.
>
> * The image worker looks at image options within the request and
>  verifies the image exists and can be sent to a host when needed. It
>  inserts any image options into the request that will be needed
>  later (streaming URL, ...). The modified message is then sent to
>  the scheduler worker.
>
> * The scheduler finds a cluster and host for the request based on
>  any options provided (locality to other hosts, network requirements,
>  ...) and the current state of the system (like which hosts have
>  capacity). Once a host is chosen it will be put on the queue for
>  the network service worker.
>
> * The network worker will assign an IP address for the new instance,
>  put this into the message, and insert the message back into the
>  queue for the host that was selected in the scheduler.
>
> * The host machines picks up the message and initiates the build. This
>  involves streaming the image from the image service and then getting
>  the image running under the hypervisor. During each step the host
>  worker can send status updates upstream through the message queue
>  so the other layers (like the scheduler) are aware of the progress.
>
> * The HTTP response can happen at any layer depending on what the
>  API wants. If it wants to block until the IP assignment, a request
>  can be sent back to the API server once the IP worker has completed
>  this step. It is also possible for the API server or scheduler
>  to assign a URN right away for a non-blocking request, and this
>  URN could be queried later to find out things like the IP. Once we
>  support webhooks or PuSH, notifications could also be sent for this
>  URN to those listening for it.
>
> Requests such as listing all guests for an account could be answered at
> the API or scheduler layer using the data that is kept there. Requests
> such as reboot would make it to the host via the scheduler by using
> this same subset of information.
>
> This architecture still needs some thought put into it, but it
> represents a shift away from using shared data storage. It does require
> a bit of an architecture change from what currently exists in Nova, but
> like other discussions going on right now, we need to think about what
> the long term will require with both scalability and reliability. We
> can come up with a plan to gradually move things in this direction,
> but we do need to figure out what type of architecture we are shooting
> for in the end.
>
> Wow, this is way longer than I thought it would be. Thoughts and
> feedback much appreciated. :)
>
> -Eric
>
> _______________________________________________
> Mailing list: https://launchpad.net/~nova
> Post to     : nova@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~nova
> More help   : https://help.launchpad.net/ListHelp
>

Follow ups

References