← Back to team overview

nova-scaling team mailing list archive

ETL tool usage ...

 

At the summit we were discussing the possibility of using an ETL to roll up data from child zones to provide better performance at the top level. Some people were asking about more info on this ... here's the response I sent:

---

Hi guys, 

The use case we have for the ETL tool is two-fold:

1. The 'nova list' problem. 

When a user issues a 'nova list' command to show all of their instances, we have to do a, potentially very large, query against the database and across all zones. This would be computationally very expensive and time consuming. What we would prefer to do is have a cache of all the instances that we could return quickly and page efficiently. We realize there may be some disconnect in the expected/actual instance states using this method but that's acceptable.

This "aggregate table" would be a fully denormalized view of the Instance table along with some related tables (like InstanceType) to provide something meaningful to the caller.

2. Trusted Zones

We have also been discussing the possibility of "trusted zones". Currently child zones are Share Nothing with their parents. In a trusted deployment there are optimizations we can make to speed up decision making around where to place an instance.

This means sharing current cross-zone host status in a top-level cache. This wouldn't be applicable in a public/private scenario, but only in deployments where child zones are trusted. 

The ETL requirements here are more strict. We need to get DB transaction-level updates from the child zones to keep the cache fresh. We're still debating the merits of using ETL for this vs. Rabbit and custom code because OSS ETL tools are rough at best and MySql isn't the best choice for the task (Postgres would be better). Commercial ETL tools like Informatica sort of defeat the purpose. Additionally, ETL is outside the comfort zone of many administrators.

Hope this provides a little more background!

-S