openstack team mailing list archive

Thread
Date
Re: Database stuff

To: "openstack@xxxxxxxxxxxxxxxxxxx" <openstack@xxxxxxxxxxxxxxxxxxx>
From: Aaron Lee <aaron.lee@xxxxxxxxxxxxx>
Date: Tue, 29 Nov 2011 17:54:29 +0000
Accept-language: en-US
In-reply-to: <3FC0EBA3-64C7-48FE-B206-95F4C99766C1@rackspace.com>
Thread-index: AQHMrqqzcS2bV6aKX0mSKdxKFolHipXEcA+AgAAXEQA=
Thread-topic: [Openstack] Database stuff
For this I think we need to separate the models from the queries. A query method should be able to populate as many of the models as needed to return the data collected. I also think separating the queries from the models themselves will help us make the storage engine replaceable, and will allow us to extract common behavior into the models themselves instead of peppered throughout the codebase.

Aaron

On Nov 29, 2011, at 10:31 AM, Chris Behrens wrote:

> e) sounds good as long as we don't remove the ability to joinload up front.  Sometimes we need to join.  Sometimes we can be lazy.  With 'list instances with details', we need to get all instances from the DB and join network information in a single DB query.  Doing 1 + (n*x) DB queries to support a 'nova list' will not be acceptable when it can be done in 1 query today.  That means we'll need instance['virtual_interfaces'] (or similar) to work in cases where we 'join up front'.  Most cases when we're pulling instances from the DB, we don't need to join anything.  It'd be more efficient to not join network information for those calls.  This is where e) is fine.
> 
> - Chris
> 
> On Nov 29, 2011, at 9:20 AM, Soren Hansen wrote:
> 
>> Hi, guys.
>> 
>> Gosh, this turned out to be long. Sorry about that.
>> 
>> I'm adding some tests for the DB api, and I've stumbled across something
>> that we should probably discuss.
>> 
>> First of all, most (if not all) of the various *_create methods we have,
>> return quite amputated objects. Any attempt to access related objects
>> will fail with the much too familiar:
>> 
>>  DetachedInstanceError: Parent instance <Whatever at 0x4f5c8d0> is not
>>  bound to a Session; lazy load operation of attribute 'other_things'
>>  cannot proceed
>> 
>> Also, with the SQLAlchemy driver, this test would pass:
>> 
>>   network = db.network_create(ctxt, {})
>>   network = db.network_get(ctxt, network['id'])
>> 
>>   instance = db.instance_create(ctxt, {})
>>   self.assertEquals(len(network['virtual_interfaces']), 0)
>>   db.virtual_interface_create(ctxt, {'network_id': network['id'],
>>                                      'instance_id': instance['id']}
>> 
>>   self.assertEquals(len(network['virtual_interfaces']), 0)
>>   network = db.network_get(ctxt, network['id'])
>>   self.assertEquals(len(network['virtual_interfaces']), 1)
>> 
>> I create a network, pull it out again (as per my comment above), verify
>> that it has no virtual_interfaces related to it, create a virtual
>> interface in this network, and check the network's virtual_interfaces
>> key and finds that it still has length 0. Reloading the network now
>> reveals the new virtual interface.
>> 
>> SQLAlchemy does support looking these things up on the fly. In fact,
>> AFAIK, this is its default behaviour. We just override it with
>> joinedload options, because we don't use scoped sessions.
>> 
>> My fake db driver looks stuff like this up on the fly (so the
>> assertEquals after the virtual_interface_create will fail with that db
>> driver).
>> 
>> So my question is this: Should this be
>> 
>> a) looked up on the fly,
>> b) looked up on first key access and then cached,
>> c) looked up when the parent object is loaded and then never again,
>> d) or up to the driver author?
>> 
>> Or should we do away with this stuff altogether? I.e. no more looking up
>> related objects by way of __getitem__ lookups, and instead only allow
>> lookups through db methods. So, instead of
>> network['virtual_interfaces'], you'd always do
>> db.virtual_interfaces_get_by_network(ctxt, network['id']).  Let's call
>> this option e).
>> 
>> I'm pretty undecided myself. If we go with option e) it becomes clear to
>> consumers of the DB api when they're pulling out fresh stuff from the DB
>> and when they're reusing potentially old results.  Explicit is better
>> than implicit, but it'll take quite a bit of work to change this.
>> 
>> If we go with one of options a) through d), my order of preference is
>> (from most to least preferred): a), d), c), b).
>> 
>> There's value in having a right way and a wrong way to do this. If it's
>> either-or, it makes testing (as in validation) more awkward. I'd say
>> it's always possible to do on-the-fly lookups. Overriding __getitem__
>> and fetching fresh results is pretty simple, and, as mentioned earlier,
>> I believe this is SQLAlchemy's default behaviour (somebody please
>> correct me if I'm wrong). Forcing an arbitrary ORM to replicate the
>> behaviour of b) and c) could be incredibly awkward, and c) is also
>> complicated because there might be reference loops involved. Also,
>> reviewing correct use of something where the need for reloads depends on
>> previous use of your db objects (which might itself be conditional or
>> happen in called methods) sounds like no fun at all.  With d) it's
>> pretty straightforward: "Do you want to be sure to have fresh responses?
>> Then reload the object.  Otherwise, behaviour is undefined." It's
>> straightforward to explain and review.  Option e) is also easy to
>> explain and do reviews for, btw.
>> 
>> DB objects with options a) through d) will have trouble making it
>> through a serializer. After dict serialisation, the object isn't going
>> to change, unresolved related objects will not be resolved, and
>> prepopulating everything prior to serialisation is out of the question
>> due to the potential for loops and very big sets of related objects (and
>> their related objects and their related objects's related objects,
>> etc.). I think it would be valuable to not have to think a whole lot
>> about whether a particular db-like object is a "real" db object or
>> whether it came in as a JSON object or over the message bus or whatever.
>> Only option e) will give us that, AFAICS.
>> 
>> It seems I've talked myself into preferring option e). It's too much
>> work to do on my own, though, and it's going to be disruptive, so we
>> need to do it real soon. I think it'll be worth it, though.
>> 
>> -- 
>> Soren Hansen        | http://linux2go.dk/
>> Ubuntu Developer    | http://www.ubuntu.com/
>> OpenStack Developer | http://www.openstack.org/
>> 
>> _______________________________________________
>> Mailing list: https://launchpad.net/~openstack
>> Post to     : openstack@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~openstack
>> More help   : https://help.launchpad.net/ListHelp
> 
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp
Follow ups

Re: Database stuff
From: Duncan McGreggor, 2011-11-29
References

Database stuff
From: Soren Hansen, 2011-11-29
Re: Database stuff
From: Chris Behrens, 2011-11-29