← Back to team overview

nova team mailing list archive

Re: Data architecture

 

Hi Paul,

On Wed, Aug 04, 2010 at 12:51:36PM -0500, Paul Voccio wrote:
>    Eric,
> 
>    In your new arch diagram ( http://oddments.org/tmp/architecture_new.png)
>    isn't the host/guest cache the same thing as the datastore? The scheduler
>    will need to know which cluster queue to send the request to. Is the
>    question then one of datastore/cache performance and availability?

The host/guest cache is the in-memory table with data it needs
about hosts, clusters, and guests to answer some queries and route
others. The diagram keeps this separate from the datastore the host
would use so the scheduler doesn't need to use this data store.

>    I know we want to go as close to 1MM nodes as we can get, but for the .001
>    (or whatever) release should we just solidify on the messaging arch then
>    worry about performance? I now this is your driver, but I wonder if it is
>    a bit early to optimize.

Good points, and I understand we are not going to get there
overnight. I wanted to put this idea out there now and get folks
thinking how the datastore will be structured with these target numbers
because decisions made now may make it very difficult to change in
the future. If we're putting work into changing our data model and
switching to an ORM now, it may be better to put that time and effort
into a data model we know we'll need at some point in the future.

-Eric

> 
>    Pvo
> 
>    On 8/3/10 10:55 PM, "Eric Day" <eday@xxxxxxxxxxxx> wrote:
> 
>      Hi Vish!
> 
>      On Tue, Aug 03, 2010 at 08:11:23PM -0700, Vishvananda Ishaya wrote:
>      >    We started with an architecture similar to this.  I think it
>      definitely
>      >    has merit, but I don't see any real advantages over a centralized
>      data
>      >    store.  Can you be more specific about your concerns with response
>      time
>      >    and high availability?  There is no reason you can't cache data
>      locally in
>      >    either scenario.  It seems this method just shifts availability
>      concerns
>      >    from the datastore to the queue and the host machines.
> 
>      Well, we are always going to have availability concerns around the
>      message queue, we won't get away from those. A centralized database
>      is another point of failure, if it goes down, the entire system is
>      down. On the other hand, if a host worker goes down, just some ops on
>      that host go down (which would not matter anyway since the VMs would
>      probably be down too). If we have 1M hosts + other APIs/workers all
>      hitting the data store it's going to get quite busy. Assuming there
>      are cached/persistent connections that are read-write, that's a lot
>      for one server to handle. We would need to split this up with sharding,
>      but this introduces more maintenance overhead.
> 
>      >    What happens when a host dies?  There is no way to get any
>      information
>      >    about the instances that were on that host unless you have it in
>      cache.
>      >     You clearly need some kind of out-of-process storage for data, so
>      why
>      >    have 1 million tiny databases connected by a complex message
>      passing
>      >    system, when you can put them together and have the advantage of
>      querying
>      >    accross relations?
> 
>      A host can optionally backup data to an external store as I mentioned
>      in the original email for disaster recovery. Of course if a host
>      crashes hard with all data, we've lost the VMs too, so probably not
>      very useful to get the metadata back. If folks use VM backups, this
>      should contain image + metadata so it could be rebuilt without the
>      host if needed.
> 
>      As far as querying over relations of the entire dataset, I don't think
>      this is should be part of the system since these queries will be very
>      heavy. :)  I think it would be better to query the replicated subsets
>      of data, or if a full analysis is needed, support query/aggregation
>      to all hosts in a map/reduce like fashion.
> 
>      >    I like the centralized datastore because it is really simple.  It
>      seems
>      >    like a relational database is a good fit for the amount of data we
>      are
>      >    expecting, even with the 1,000,000 host machine 60,000,000
>      instances we
>      >    are hoping to support.  Redis is really fun to use, and the atomic
>      >    operations with sets are beautiful, but I think relational is the
>      way to
>      >    go for this problem set.
> 
>      I agree, the centralized data store is simple and I keep going back
>      and forth. It removes the need for some of this complexity, but it's
>      a big point of failure and congestion. We could still use relational
>      for the architecture I'm talking about, it's just one shard/host and
>      no access but from the host (I realize this isn't as useful as the
>      centralized or larger shard route).
> 
>      Thinking if we did have 1M hosts, and we sharded at the cluster level,
>      the scheduler workers would need to be aware of the message queue and
>      database for each cluster. You probably wouldn't want the scheduler
>      querying the databases on every request, you would want to store
>      system information in a cache somewhere. We wouldn't to poll for
>      changes from the DB, which means we'd want the network/volume/host
>      workers notifying the schedulers of changes. This starts to look a lot
>      like the architecture in my original email (just data store/cluster
>      rather than data store/host).
> 
>      What are you thinking the architecture would look like at the 1M
>      host case?
> 
>      -Eric
> 
>      >    Vish
>      >
>      >    On Tue, Aug 3, 2010 at 6:31 PM, Eric Day <eday@xxxxxxxxxxxx> wrote:
>      >
>      >      Hi everyone,
>      >
>      >      There was quite a bit discussion on IRC today about switching
>      from
>      >      Redis to a SQL based database, and which ORM to choose. Before
>      >      going down that path, I think we need to take a look at the
>      overall
>      >      architecture we're moving towards, identify our data storage
>      needs,
>      >      and then choose based on those requirements.
>      >
>      >      I'm not convinced that the shared data storage solution that
>      exists
>      >      today with Redis (or any other back-end) is optimal. Here is the
>      >      architecture as it exists today, minus a few details:
>      >
>      >      http://oddments.org/tmp/architecture_old.png
>      >
>      >      The API servers poke into the data store to help answer some
>      queries,
>      >      along with the compute workers and volume workers. I know Vish is
>      >      reworking the network services, and I'm guessing that will be
>      using
>      >      the data store as well.
>      >
>      >      My concern here is that while Redis (or other database solutions)
>      >      can scale quite easily for our data set size, there are still
>      response
>      >      time and HA concerns. You can use local caching to help response
>      time
>      >      and run Redis slaves with promotion for HA, but this still seems
>      a
>      >      bit clunky. One of our design tenets is shared-nothing, but right
>      >      now the data store is shared everywhere.
>      >
>      >      Below is a proposal that came after many discussions with various
>      >      folks starting in Austin a few weeks ago. Paul Voccio and I
>      drafted
>      >      up a few of these ideas on paper during OSCON. The idea is to let
>      the
>      >      host machine be the canonical store for all data, and any
>      information
>      >      the API layer (or new scheduler layer) needs can be pushed up
>      when
>      >      changed. The new architecture would look something like this:
>      >
>      >      http://oddments.org/tmp/architecture_new.png
>      >
>      >      Beyond the data changes, this diagram also introduces the
>      scheduler,
>      >      pluggable image service, and a concept of a cluster. None of
>      these
>      >      are required for the data model change but are included for now.
>      >
>      >      A cluster is just a grouping mainly for networking purposes. For
>      >      example, all hosts in a cluster may be on the same switch or
>      >      are in the same subnet. Large installations under some network
>      >      configurations needs this level of grouping. While the message
>      >      queue for each cluster is displayed as being separate, this is
>      not
>      >      required. For small installations you could use a single message
>      >      queue for everything. The separation is there if needed because
>      you
>      >      may have enough hosts (which translate to at least one queue
>      topic)
>      >      to overload a single message queue.
>      >
>      >      The scheduler is introduced here and can coordinate requests
>      between
>      >      the API servers and the different clusters. This is especially
>      >      important for build requests where you need to coordinate a good
>      fit
>      >      for new images based on a number of parameters. The image worker
>      is
>      >      pluggable and could be provided by the local filesystem, an
>      object
>      >      store (like swift), or anything else. The host, volume, and
>      network
>      >      workers are almost identical to what exists in the current
>      system.
>      >
>      >      Getting back to the data storage model, note that there is no
>      >      shared storage between any of the services. Each host worker
>      would
>      >      maintain all information needed for each guest it is running
>      either
>      >      in the hypervisor or through another local data store. This data
>      can
>      >      be backed up to another services (like redis, postgresql, mysql,
>      >      ...) but the host will be the only one service accessing it. The
>      >      backup data store could of course be shared between many hosts as
>      well.
>      >
>      >      The scheduler (and possibly the API servers) keep a subset of the
>      host
>      >      data stored in memory so they can answer requests or route
>      requests to
>      >      the appropriate host quickly. These processes can keep snapshots
>      of
>      >      this data so it doesn't need to pull the entire data set on
>      restart,
>      >      and for any new changes it can requests events since the last
>      known
>      >      timestamp. This does add some complexity to the data model since
>      >      we'll need to push events up from the hosts based on timestamps,
>      >      but it doesn't rely on any shared storage.
>      >
>      >      Other services like network and volume can also maintain their
>      own
>      >      data stores and process requests much like the hosts do. For
>      example,
>      >      when a new guest instance is being provisioned, it will hit the
>      network
>      >      workers for IP assignment before reaching the host to be
>      provisioned.
>      >
>      >      Here is the flow of a guest creation request:
>      >
>      >      * The HTTP request is verified and converted to an internal
>      message
>      >       format. This is just a set of key/value pairs representing
>      options
>      >       for the request (image, size, ...). Once the message is
>      constructed
>      >       it is sent to the image worker.
>      >
>      >      * The image worker looks at image options within the request and
>      >       verifies the image exists and can be sent to a host when needed.
>      It
>      >       inserts any image options into the request that will be needed
>      >       later (streaming URL, ...). The modified message is then sent to
>      >       the scheduler worker.
>      >
>      >      * The scheduler finds a cluster and host for the request based on
>      >       any options provided (locality to other hosts, network
>      requirements,
>      >       ...) and the current state of the system (like which hosts have
>      >       capacity). Once a host is chosen it will be put on the queue for
>      >       the network service worker.
>      >
>      >      * The network worker will assign an IP address for the new
>      instance,
>      >       put this into the message, and insert the message back into the
>      >       queue for the host that was selected in the scheduler.
>      >
>      >      * The host machines picks up the message and initiates the build.
>      This
>      >       involves streaming the image from the image service and then
>      getting
>      >       the image running under the hypervisor. During each step the
>      host
>      >       worker can send status updates upstream through the message
>      queue
>      >       so the other layers (like the scheduler) are aware of the
>      progress.
>      >
>      >      * The HTTP response can happen at any layer depending on what the
>      >       API wants. If it wants to block until the IP assignment, a
>      request
>      >       can be sent back to the API server once the IP worker has
>      completed
>      >       this step. It is also possible for the API server or scheduler
>      >       to assign a URN right away for a non-blocking request, and this
>      >       URN could be queried later to find out things like the IP. Once
>      we
>      >       support webhooks or PuSH, notifications could also be sent for
>      this
>      >       URN to those listening for it.
>      >
>      >      Requests such as listing all guests for an account could be
>      answered at
>      >      the API or scheduler layer using the data that is kept there.
>      Requests
>      >      such as reboot would make it to the host via the scheduler by
>      using
>      >      this same subset of information.
>      >
>      >      This architecture still needs some thought put into it, but it
>      >      represents a shift away from using shared data storage. It does
>      require
>      >      a bit of an architecture change from what currently exists in
>      Nova, but
>      >      like other discussions going on right now, we need to think about
>      what
>      >      the long term will require with both scalability and reliability.
>      We
>      >      can come up with a plan to gradually move things in this
>      direction,
>      >      but we do need to figure out what type of architecture we are
>      shooting
>      >      for in the end.
>      >
>      >      Wow, this is way longer than I thought it would be. Thoughts and
>      >      feedback much appreciated. :)
>      >
>      >      -Eric
>      >
>      >      _______________________________________________
>      >      Mailing list: https://launchpad.net/~nova
>      >      Post to     : nova@xxxxxxxxxxxxxxxxxxx
>      >      Unsubscribe : https://launchpad.net/~nova
>      >      More help   : https://help.launchpad.net/ListHelp
> 
>      _______________________________________________
>      Mailing list: https://launchpad.net/~nova
>      Post to     : nova@xxxxxxxxxxxxxxxxxxx
>      Unsubscribe : https://launchpad.net/~nova
>      More help   : https://help.launchpad.net/ListHelp
> 
>  Confidentiality Notice: This e-mail message (including any attached or
>  embedded documents) is intended for the exclusive and confidential use of the
>  individual or entity to which this message is addressed, and unless otherwise
>  expressly indicated, is confidential and privileged information of Rackspace.
>  Any dissemination, distribution or copying of the enclosed material is prohibited.
>  If you receive this transmission in error, please notify us immediately by e-mail
>  at abuse@xxxxxxxxxxxxx, and delete the original message.
>  Your cooperation is appreciated.





Follow ups

References