nova team mailing list archive

Thread
Date
Re: Data architecture

To: Vishvananda Ishaya <vishvananda@xxxxxxxxx>
From: Eric Day <eday@xxxxxxxxxxxx>
Date: Tue, 3 Aug 2010 20:55:17 -0700
Cc: nova@xxxxxxxxxxxxxxxxxxx
In-reply-to: <AANLkTi=MvP2DHN94x0zqsNvxuO=1SuuyMoz6+PMKtMdH@mail.gmail.com>
User-agent: Mutt/1.5.20 (2009-06-14)
Hi Vish!

On Tue, Aug 03, 2010 at 08:11:23PM -0700, Vishvananda Ishaya wrote:
>    We started with an architecture similar to this.  I think it definitely
>    has merit, but I don't see any real advantages over a centralized data
>    store.  Can you be more specific about your concerns with response time
>    and high availability?  There is no reason you can't cache data locally in
>    either scenario.  It seems this method just shifts availability concerns
>    from the datastore to the queue and the host machines.

Well, we are always going to have availability concerns around the
message queue, we won't get away from those. A centralized database
is another point of failure, if it goes down, the entire system is
down. On the other hand, if a host worker goes down, just some ops on
that host go down (which would not matter anyway since the VMs would
probably be down too). If we have 1M hosts + other APIs/workers all
hitting the data store it's going to get quite busy. Assuming there
are cached/persistent connections that are read-write, that's a lot
for one server to handle. We would need to split this up with sharding,
but this introduces more maintenance overhead.

>    What happens when a host dies?  There is no way to get any information
>    about the instances that were on that host unless you have it in cache.
>     You clearly need some kind of out-of-process storage for data, so why
>    have 1 million tiny databases connected by a complex message passing
>    system, when you can put them together and have the advantage of querying
>    accross relations?

A host can optionally backup data to an external store as I mentioned
in the original email for disaster recovery. Of course if a host
crashes hard with all data, we've lost the VMs too, so probably not
very useful to get the metadata back. If folks use VM backups, this
should contain image + metadata so it could be rebuilt without the
host if needed.

As far as querying over relations of the entire dataset, I don't think
this is should be part of the system since these queries will be very
heavy. :)  I think it would be better to query the replicated subsets
of data, or if a full analysis is needed, support query/aggregation
to all hosts in a map/reduce like fashion.

>    I like the centralized datastore because it is really simple.  It seems
>    like a relational database is a good fit for the amount of data we are
>    expecting, even with the 1,000,000 host machine 60,000,000 instances we
>    are hoping to support.  Redis is really fun to use, and the atomic
>    operations with sets are beautiful, but I think relational is the way to
>    go for this problem set.

I agree, the centralized data store is simple and I keep going back
and forth. It removes the need for some of this complexity, but it's
a big point of failure and congestion. We could still use relational
for the architecture I'm talking about, it's just one shard/host and
no access but from the host (I realize this isn't as useful as the
centralized or larger shard route).

Thinking if we did have 1M hosts, and we sharded at the cluster level,
the scheduler workers would need to be aware of the message queue and
database for each cluster. You probably wouldn't want the scheduler
querying the databases on every request, you would want to store
system information in a cache somewhere. We wouldn't to poll for
changes from the DB, which means we'd want the network/volume/host
workers notifying the schedulers of changes. This starts to look a lot
like the architecture in my original email (just data store/cluster
rather than data store/host).

What are you thinking the architecture would look like at the 1M
host case?

-Eric

>    Vish
> 
>    On Tue, Aug 3, 2010 at 6:31 PM, Eric Day <eday@xxxxxxxxxxxx> wrote:
> 
>      Hi everyone,
> 
>      There was quite a bit discussion on IRC today about switching from
>      Redis to a SQL based database, and which ORM to choose. Before
>      going down that path, I think we need to take a look at the overall
>      architecture we're moving towards, identify our data storage needs,
>      and then choose based on those requirements.
> 
>      I'm not convinced that the shared data storage solution that exists
>      today with Redis (or any other back-end) is optimal. Here is the
>      architecture as it exists today, minus a few details:
> 
>      http://oddments.org/tmp/architecture_old.png
> 
>      The API servers poke into the data store to help answer some queries,
>      along with the compute workers and volume workers. I know Vish is
>      reworking the network services, and I'm guessing that will be using
>      the data store as well.
> 
>      My concern here is that while Redis (or other database solutions)
>      can scale quite easily for our data set size, there are still response
>      time and HA concerns. You can use local caching to help response time
>      and run Redis slaves with promotion for HA, but this still seems a
>      bit clunky. One of our design tenets is shared-nothing, but right
>      now the data store is shared everywhere.
> 
>      Below is a proposal that came after many discussions with various
>      folks starting in Austin a few weeks ago. Paul Voccio and I drafted
>      up a few of these ideas on paper during OSCON. The idea is to let the
>      host machine be the canonical store for all data, and any information
>      the API layer (or new scheduler layer) needs can be pushed up when
>      changed. The new architecture would look something like this:
> 
>      http://oddments.org/tmp/architecture_new.png
> 
>      Beyond the data changes, this diagram also introduces the scheduler,
>      pluggable image service, and a concept of a cluster. None of these
>      are required for the data model change but are included for now.
> 
>      A cluster is just a grouping mainly for networking purposes. For
>      example, all hosts in a cluster may be on the same switch or
>      are in the same subnet. Large installations under some network
>      configurations needs this level of grouping. While the message
>      queue for each cluster is displayed as being separate, this is not
>      required. For small installations you could use a single message
>      queue for everything. The separation is there if needed because you
>      may have enough hosts (which translate to at least one queue topic)
>      to overload a single message queue.
> 
>      The scheduler is introduced here and can coordinate requests between
>      the API servers and the different clusters. This is especially
>      important for build requests where you need to coordinate a good fit
>      for new images based on a number of parameters. The image worker is
>      pluggable and could be provided by the local filesystem, an object
>      store (like swift), or anything else. The host, volume, and network
>      workers are almost identical to what exists in the current system.
> 
>      Getting back to the data storage model, note that there is no
>      shared storage between any of the services. Each host worker would
>      maintain all information needed for each guest it is running either
>      in the hypervisor or through another local data store. This data can
>      be backed up to another services (like redis, postgresql, mysql,
>      ...) but the host will be the only one service accessing it. The
>      backup data store could of course be shared between many hosts as well.
> 
>      The scheduler (and possibly the API servers) keep a subset of the host
>      data stored in memory so they can answer requests or route requests to
>      the appropriate host quickly. These processes can keep snapshots of
>      this data so it doesn't need to pull the entire data set on restart,
>      and for any new changes it can requests events since the last known
>      timestamp. This does add some complexity to the data model since
>      we'll need to push events up from the hosts based on timestamps,
>      but it doesn't rely on any shared storage.
> 
>      Other services like network and volume can also maintain their own
>      data stores and process requests much like the hosts do. For example,
>      when a new guest instance is being provisioned, it will hit the network
>      workers for IP assignment before reaching the host to be provisioned.
> 
>      Here is the flow of a guest creation request:
> 
>      * The HTTP request is verified and converted to an internal message
>       format. This is just a set of key/value pairs representing options
>       for the request (image, size, ...). Once the message is constructed
>       it is sent to the image worker.
> 
>      * The image worker looks at image options within the request and
>       verifies the image exists and can be sent to a host when needed. It
>       inserts any image options into the request that will be needed
>       later (streaming URL, ...). The modified message is then sent to
>       the scheduler worker.
> 
>      * The scheduler finds a cluster and host for the request based on
>       any options provided (locality to other hosts, network requirements,
>       ...) and the current state of the system (like which hosts have
>       capacity). Once a host is chosen it will be put on the queue for
>       the network service worker.
> 
>      * The network worker will assign an IP address for the new instance,
>       put this into the message, and insert the message back into the
>       queue for the host that was selected in the scheduler.
> 
>      * The host machines picks up the message and initiates the build. This
>       involves streaming the image from the image service and then getting
>       the image running under the hypervisor. During each step the host
>       worker can send status updates upstream through the message queue
>       so the other layers (like the scheduler) are aware of the progress.
> 
>      * The HTTP response can happen at any layer depending on what the
>       API wants. If it wants to block until the IP assignment, a request
>       can be sent back to the API server once the IP worker has completed
>       this step. It is also possible for the API server or scheduler
>       to assign a URN right away for a non-blocking request, and this
>       URN could be queried later to find out things like the IP. Once we
>       support webhooks or PuSH, notifications could also be sent for this
>       URN to those listening for it.
> 
>      Requests such as listing all guests for an account could be answered at
>      the API or scheduler layer using the data that is kept there. Requests
>      such as reboot would make it to the host via the scheduler by using
>      this same subset of information.
> 
>      This architecture still needs some thought put into it, but it
>      represents a shift away from using shared data storage. It does require
>      a bit of an architecture change from what currently exists in Nova, but
>      like other discussions going on right now, we need to think about what
>      the long term will require with both scalability and reliability. We
>      can come up with a plan to gradually move things in this direction,
>      but we do need to figure out what type of architecture we are shooting
>      for in the end.
> 
>      Wow, this is way longer than I thought it would be. Thoughts and
>      feedback much appreciated. :)
> 
>      -Eric
> 
>      _______________________________________________
>      Mailing list: https://launchpad.net/~nova
>      Post to     : nova@xxxxxxxxxxxxxxxxxxx
>      Unsubscribe : https://launchpad.net/~nova
>      More help   : https://help.launchpad.net/ListHelp
Follow ups

Re: Data architecture
From: Paul Voccio, 2010-08-04
References

Data architecture
From: Eric Day, 2010-08-04
Re: Data architecture
From: Vishvananda Ishaya, 2010-08-04