← Back to team overview

nova team mailing list archive

Data architecture


Hi everyone,

There was quite a bit discussion on IRC today about switching from
Redis to a SQL based database, and which ORM to choose. Before
going down that path, I think we need to take a look at the overall
architecture we're moving towards, identify our data storage needs,
and then choose based on those requirements.

I'm not convinced that the shared data storage solution that exists
today with Redis (or any other back-end) is optimal. Here is the
architecture as it exists today, minus a few details:


The API servers poke into the data store to help answer some queries,
along with the compute workers and volume workers. I know Vish is
reworking the network services, and I'm guessing that will be using
the data store as well.

My concern here is that while Redis (or other database solutions)
can scale quite easily for our data set size, there are still response
time and HA concerns. You can use local caching to help response time
and run Redis slaves with promotion for HA, but this still seems a
bit clunky. One of our design tenets is shared-nothing, but right
now the data store is shared everywhere.

Below is a proposal that came after many discussions with various
folks starting in Austin a few weeks ago. Paul Voccio and I drafted
up a few of these ideas on paper during OSCON. The idea is to let the
host machine be the canonical store for all data, and any information
the API layer (or new scheduler layer) needs can be pushed up when
changed. The new architecture would look something like this:


Beyond the data changes, this diagram also introduces the scheduler,
pluggable image service, and a concept of a cluster. None of these
are required for the data model change but are included for now.

A cluster is just a grouping mainly for networking purposes. For
example, all hosts in a cluster may be on the same switch or
are in the same subnet. Large installations under some network
configurations needs this level of grouping. While the message
queue for each cluster is displayed as being separate, this is not
required. For small installations you could use a single message
queue for everything. The separation is there if needed because you
may have enough hosts (which translate to at least one queue topic)
to overload a single message queue.

The scheduler is introduced here and can coordinate requests between
the API servers and the different clusters. This is especially
important for build requests where you need to coordinate a good fit
for new images based on a number of parameters. The image worker is
pluggable and could be provided by the local filesystem, an object
store (like swift), or anything else. The host, volume, and network
workers are almost identical to what exists in the current system.

Getting back to the data storage model, note that there is no
shared storage between any of the services. Each host worker would
maintain all information needed for each guest it is running either
in the hypervisor or through another local data store. This data can
be backed up to another services (like redis, postgresql, mysql,
...) but the host will be the only one service accessing it. The
backup data store could of course be shared between many hosts as well.

The scheduler (and possibly the API servers) keep a subset of the host
data stored in memory so they can answer requests or route requests to
the appropriate host quickly. These processes can keep snapshots of
this data so it doesn't need to pull the entire data set on restart,
and for any new changes it can requests events since the last known
timestamp. This does add some complexity to the data model since
we'll need to push events up from the hosts based on timestamps,
but it doesn't rely on any shared storage.

Other services like network and volume can also maintain their own
data stores and process requests much like the hosts do. For example,
when a new guest instance is being provisioned, it will hit the network
workers for IP assignment before reaching the host to be provisioned.

Here is the flow of a guest creation request:

* The HTTP request is verified and converted to an internal message
  format. This is just a set of key/value pairs representing options
  for the request (image, size, ...). Once the message is constructed
  it is sent to the image worker.

* The image worker looks at image options within the request and
  verifies the image exists and can be sent to a host when needed. It
  inserts any image options into the request that will be needed
  later (streaming URL, ...). The modified message is then sent to
  the scheduler worker.

* The scheduler finds a cluster and host for the request based on
  any options provided (locality to other hosts, network requirements,
  ...) and the current state of the system (like which hosts have
  capacity). Once a host is chosen it will be put on the queue for
  the network service worker.

* The network worker will assign an IP address for the new instance,
  put this into the message, and insert the message back into the
  queue for the host that was selected in the scheduler.

* The host machines picks up the message and initiates the build. This
  involves streaming the image from the image service and then getting
  the image running under the hypervisor. During each step the host
  worker can send status updates upstream through the message queue
  so the other layers (like the scheduler) are aware of the progress.

* The HTTP response can happen at any layer depending on what the
  API wants. If it wants to block until the IP assignment, a request
  can be sent back to the API server once the IP worker has completed
  this step. It is also possible for the API server or scheduler
  to assign a URN right away for a non-blocking request, and this
  URN could be queried later to find out things like the IP. Once we
  support webhooks or PuSH, notifications could also be sent for this
  URN to those listening for it.

Requests such as listing all guests for an account could be answered at
the API or scheduler layer using the data that is kept there. Requests
such as reboot would make it to the host via the scheduler by using
this same subset of information.

This architecture still needs some thought put into it, but it
represents a shift away from using shared data storage. It does require
a bit of an architecture change from what currently exists in Nova, but
like other discussions going on right now, we need to think about what
the long term will require with both scalability and reliability. We
can come up with a plan to gradually move things in this direction,
but we do need to figure out what type of architecture we are shooting
for in the end.

Wow, this is way longer than I thought it would be. Thoughts and
feedback much appreciated. :)


Follow ups