← Back to team overview

launchpad-dev team mailing list archive

Re: Lower query counts via different code structure

 

On Tue, Nov 2, 2010 at 4:07 AM, Aaron Bentley <aaron@xxxxxxxxxxxxx> wrote:
> On 10/31/2010 08:47 PM, Robert Collins wrote:
>> I've a new structure for our code that I'd like to propose.
>
> I'm very pleased to see this proposal.  It fits my own thinking quite well.

\o/ - its nice when an idea fits what other folk have been thinking
straight off the bat.

>> So the basic idea is to extend the idiom of working on sets of data -
>> groups - to the very top of our code stack. Rather than
>> Person.is_valid_team, we'd have Persons.are_valid_teams() or
>> Persons.filter(valid_team=True).
>
> I hope that would include extending security checks to working on sets
> of data.

Certainly. I'd expect this to be pretty pervasive - we'll want to take
it up through the template layer, APIs and so on.

>>  - we'd want to convert individual attributes at a time I think, in a
>> process like:
>>   - add a Groups method to specify needing the attribute
>>   - delegate to the Groups method from the current property/function
>>   - migrate callers one at a time
>>   - callers which are using Groups and don't specify needing a
>> converted attribute could hit a cross-check which would blow up. This
>> is perhaps the wrong approach and ugly though :).
>>   - once all callers are migrated remove the delegation-to-Groups so
>> that we can't ever be bitten on that attribute again. Ever.
>
> I don't really follow this.  By attribute, you're referring to
> attributes that are mapped to table columns?  Are only retrieving
> selected columns to make queries less expensive?  Aren't such queries
> already cheap?

There are a few costs. One major one is that if a query asks for
columns outside the used index, the row has to be retrieved. That said
its rare that we'll be able to exploit this - it means that we want
data that is fully indexed. Less significant but still present
overheads - columns that aren't retrieved don't have to cross the
wire, be deserialised, be present in the SQL and thus parsed and
planned, take up memory in the appserver.

> I can see a case for not traversing foreign key references except when
> explicitly requested.

That too, though we already are pretty good at this - the current
failure mode is doing it row by row, which we do all to often.

>> Changes to how we do things:
>>  - methods on 'domain objects' would never perform DB access.
>
> These are the 'Plain Old Python' objects?  How would we deal with
> situations where we want to read and then write?

So, this is a more complex question to get good answers to. There are
some existing patterns like UnitOfWork which we could layer on top of
Storm or perhaps alter Storm to do as well. I haven't done a deep
analysis of the implications of the known patterns on our current
environment. We can either stay as we are for writes - which this
proposal suggested as a temporary measure, or we can take a stab at
what will fit best for us for writes now. I'm inclined to wait because
we're so light on writes - I think we'll get much less benefit from
improving the write side of our code for now, and we can come back
when we have restructured. Coming back later will also give us a
clearer picture of how the write story needs to fit into the read
story, because the read story will be done.

>>  - We'd probably one one (or more) Group types per db table,
>> eventually.
>
> Can you give an example where we'd want multiple Groups per db table?

Persons, People, Teams - we have both Team and Person in one db table.
We use a [fugly] hack to differentiate them at the moment. Sometimes
we know we're talking about Teams specifically, and it would be nice
to not have the non-team methods present.


>> So - what do you guys think? Does it have legs?
>
> Definitely a direction I think we should pursue.  The impedance mismatch
> between efficient SQL and the potato programming that ORMs traditionally
> encourage is hurting us.
>
> I've been listening to Ian Booth talking about the importance of
> separating business logic from the ORM objects, and it doesn't entirely
> make sense to me.  Our business logic is represented by our ORM objects,
> but the fact that they are database-backed is hardly ever important.

Right.

> Following his advice seems to mean stripping the ORM objects down until
> they are just bags of data, and then having a parallel hierarchy of
> domain objects that would apply our business logic on such bags of data.
>  Presumably, that would include specifying how to look up attributes on
> the ORM objects, and so the domain objects would wind up looking pretty
> much the same as our current ORM objects.

This is one pattern for doing things; I don't particularly like it
because it forces rearrangement when moving things from 'in appserver'
to 'in-query' : something that I'm proposing to avoid.

> I don't disagree with the argument that this would permit faster
> testing, but instead, I believe we could provide an in-memory Store
> implementation that would provide the same advantage without
> restructuring our code.  I have no appetite for maintaining yet another
> hierarchy of classes, especially if the ORM objects degenerate into bags
> of data.

Me neither. I think the key thing to address is consistent behaviour:
whatever layer we replace needs to be replaced with a consistently
behaving test double. E.g. DB constraints need to be honoured and so
forth. I can imagine a test db that is pure python and data driven
from our schema - it would be a lot of work but pretty effective.
Alternatively phrased; the test layer and real layer must be mutually
liskov substitutable.

> Perhaps we don't need that.  Maybe we can just work smarter.  If the ORM
> objects are just bags of data, why can't they be dicts?  Our database
> already encodes much of the information we care about-- the types of
> columns, foreign-key references and such.  Maybe we can design an
> interface such that DB access is done in a general way, and there are no
> ORM classes.

Some very fast web apps (usually written in PHP) do roughly this.

However one of the constraints we have is that our url routing and
publishing logic are heavily dependent on type inspection (via
Interfaces). We may want to change that, but I'd be wary of biting off
too much at once.

> # We initialize the Context with the particular database it will
> # retrieve data from
> context = Context(LAUNCHPAD_DB)
> # 'Distribution' is a table name and distro_name is a column name.
> distros = Context.search('Distribution', distro_name='ubuntu')
> # require is a generic method to replace get_foo in your example.
> # I'm not sure what the underlying representation of 'branding' in your
> # example is.
> branding = distros.require('branding')
> # distros is a generic Result class, so hot_bugs cannot be a method on
> # it.  I've made it a function, but it could be a static method or we
> # could instantiate a domain object and invoke hot_bugs on it.
> bugs = hot_bugs(distros, limit=20)
> # We're relying on the foreign key constraints that our DB knows about #
> here-- we know how a BugSubscription is related to a Bug through
> # BugSubscription.bug, and we know that 'assignee' and 'reporter' refer
> # to Person
> people = bugs.require('BugSubscription.person', 'assignee', 'reporter')
> branding.update(people.require('branding'))
> context.realise()
> distro = None
> for bug in bugs:
>    if bug['distro'] != distro:
>        distro = bug['distro']:
>            show_branding(distro)
>
> Anyhow, that's some food for thought.

Thats an nice example too. I'm very glad you took the time to craft an
example like that - I hope other folk do too :)

-Rob



Follow ups

References