launchpad-dev team mailing list archive

Thread
Date
Re: Lower query counts via different code structure

To: launchpad-dev@xxxxxxxxxxxxxxxxxxx
From: Ian Booth <ian.m.booth@xxxxxxxxx>
Date: Tue, 02 Nov 2010 01:47:13 +1000
In-reply-to: <AANLkTimPs4_0eaO-RNXPbZqFmy3U9YV630HfnaaDsF3S@mail.gmail.com>
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.12) Gecko/20101027 Lightning/1.0b2 Thunderbird/3.1.6
Hi Robert

Thanks for starting the discussion. I'll add in my 2c worth.
First I'll reiterate some core concepts we're probably all familiar
with, just to provide some context for subsequent discussion. Please
take what I say as IMHO rather than a concrete statement of fact. My
views though have been formed from dealing with these sorts of issues on
other large projects with OR mapped data models.

At a very high level, ORM layers provide at least 2 (somewhat
orthogonal) capabilities:
- define the rules used to map a domain model object graph onto a
relational schema
- provide mechanisms to load and save pieces of the object graph

There are different types of application which can be built on top of OR
mapped data models. The style of application determines the types of
performance bottlenecks likely to be encountered and the approach to
solve them. In the case of Launchpad, the system runs pretty much
stateless.This makes things way simpler. One of the major sources of
performance issues comes down to the loading of the data necessary to
satisfy a particular use case eg populate the data model for a given
view. Major performance problems arise due to:
- unnecessary loading of data via eagerly navigating relationships in
the domain model
- inefficient loading of data, often manifesting as the N+1 select problem
- unnecessary query execution caused by lack of suitable caching

A core design goal in any system should be: with as few queries as
practical (can replace "few" with "larger number of efficiently
performing" depending on other factors like db characteristics,
available caching etc), load only the data required for the particular
use case being executed (and no more). Try to avoid the requirements for
one use case "contaminating" another. The last bit allows per use case
tuning to be done without unintentionally adversely affecting the
performance of other use cases.

To achieve the above goals, my view is that a system needs to have the
following:
- all/most inter-object relationships, especially collections or
one-many, should be lazy ie don't load collections when loading a parent
object unless the user asks for it
- a clean separation between business data model, persistent data model,
and object lifecycle services
- data loading patterns and idioms which can be specified per use case,
including but not limited to projections, query iterators, batch load
sizing, and perhaps most importantly collection fetching strategy
(subselect vs join vs separate query etc)
- support for the above using an object based query language and
possibly also query criteria api

> 
> We face many problems in our code with optimising pages. Some specific ones are:
>  - impedance mismatch between sql-optimised helpers and code on a 'model' object
>  - small changes can cause death-by-queries when a previously untuned
> object is added to a template and triggers per-row queries.
>

My opinion is that this is caused in part by the lack of separation of
concerns between data model and ORM layer.

> I and others have been talking round some ideas about this for a
> while. One idea is a High Level Query Language and introducing an
> explicit mapping layer. Another is pushing much more eager loading
> facilities into and through our ORM - Storm.
>

+1 for explicit mapping layer. Business domain model objects should be
POPOs.

If by "more eager loading facilities" you mean providing the capability
to specify as part of the data load operation what related parts of the
object model should be loaded at the same time, and the fetch strategy,
I agree.


> Both of those things are good things to do but are either hard,
> incomplete or potentially high risk. That doesn't mean we shouldn't
> work on them!
> 

+1. My view is that if/when agreement is reached on the preferred
approach, we should consider implementing some new functionality,
ideally a complete vertical slice, using the new approach. This could
serve as a reference implementation that can be documented and benchmarked.

> However, I think a more important and fundamental aspect of the
> problem is that the pattern we use - and which many ORMs use, is
> inclined *by design* to grow complexity and trigger DB queries at just
> the wrong time. That pattern is the "Active Record" pattern, if you
> have the PoEAA book (my copy is in a box...somewhere). Basically it
> says that a single object represents a single row derived from the
> database, and that changes to it are pushed to the database, and
> attribute access on it traverses the object graph from the database.
> 

This pattern dates back to the early EJB days in the 90s I think; well,
that's true from a Java perspective, not sure about other languages. One
of the issues with the approach is also that it forces the domain model
to be twisted to fit the limitations of the underlying ORM layer and the
(undesirable) coupling it introduces affects the ability to develop a
domain model that is always fit for purpose.

> This pattern affects us in several ways:
>  - its the root cause of our death-by-queries behaviour...
> particularly the way that previously tuned pages regress.
>  - its the root cause for many performance problems in our APIs
> 

I'm not sure I agree it is the sole root cause - we also seem to suffer
from repeated queries and lack of appropriate caching in the right
places. The former is often caused by plain dumb business logic and
moving to a new architecture will not be a silver bullet since badly
written code will do dumb things no matter how good the underlying
infrastructure is. That's where we need to rely on good coding skills
and reviews :-)

<snip>

> 
> context = Context()
> distros = Launchpad.search(context, distro_name='ubuntu')
> branding = distros.get_branding()
> bugs = distros.hot_bugs(limit=20)
> people = bugs.get_subscribers()
> people.update(bugs.get_assignees())
> people.update(bugs.get_reporters())
> branding.update(people.get_branding())
> context.realise()
> distro = None
> for bug in bugs:
>     if bug.distro != distro:
>         distro = bug.distro:
>             show_branding(distro)
>     ...
>

<snip>

Here's where I have a slight problem, maybe. It depends on if I've
interpreted the above code snippet properly. Does the people.update()
call above write to the database? I don't think it should. I think all
operations on domain model artefacts - single entities or collections -
should be purely to update the internal state of those objects. The act
of writing that data to the underlying data store is a separate concern
and should be handled by the relevant lifecycle service for those
objects. Such services will use the separately implemented ORM layer to
do the job. There's a number of orthogonal concerns that should be
managed and implemented separately. This allows the necessary control
and optimisations to occur at the right places in the architecture to
achieve the end performance goals and system robustness in terms of per
use case tuning etc.

<snip>

> 
> 
> So the basic idea is to extend the idiom of working on sets of data -
> groups - to the very top of our code stack. Rather than
> Person.is_valid_team, we'd have Persons.are_valid_teams() or
> Persons.filter(valid_team=True).
> 

To me, a basic goal should be to limit the amount of data we load in the
first place and the above ideas seem consistent with that. Any pattern
that avoids iterating over an entire collection or indirectly triggering
situations like the N+1 select problem is a good thing.

<snip>

> 
> Evaluation:
>  - lazy as we build out the things we need, then eager on realise().
> We could add a trigger for realize to __iter__ on groups but I think
> it would reduce safety.

Agreed. I think any mechanisms introduced need to be explicit.
Developers should be consciously using the constructs available as
required rather than relying on implicit behaviour which is harder to
tune without causing unintended consequences. A key design goal should
be to provide infrastructure which offers the capability to do certain
things, and guidance on when, but not actually pull the trigger on
injecting the behaviour into a use case without the developer asking for
it so to speak.

>  - realise() could potentially perform multiple queries per group - or
> could use DecoratedResultSet (or similar) to perform a single query. A
> primary point of this interface is to let folk working on performance
> change from single query to multiple query *without changing the
> callers*.
> 
> Interoperation/Migration:
>  - For mutating queries, we need to return existing classes (because
> I'm not proposing a mutating interface [yet]), so our existing Table
> based classes stay, just as they are.
>  - we're read heavy: we can start moving methods off of the Table
> classes and onto Groups incrementally. Where a mutating operation
> needs a method, a transient Group can be created containing just
> [self] to permit forwarding the method but not duplicating the code.
>  - things which have no mutating operation can be converted to
> non-Table class - to Plain Old Python Objects, which can be unit
> tested to our hearts content.
>  - the machinery needed to bootstrap should be minimal, so if we as a
> team like the sound of this, we could sensibly say 'all new code
> should be like this' and start a mass migration.

My own view is that I think this needs to be planned a little more first
to understand the full impact across the entire product, including
external APIs, jobs, cron scripts etc. Let's understand the impacts a
little more before we dive in and find out we have to refactor due to
some unforeseen problem.

>  - we'd want to convert individual attributes at a time I think, in a
> process like:
>   - add a Groups method to specify needing the attribute
>   - delegate to the Groups method from the current property/function
>   - migrate callers one at a time
>   - callers which are using Groups and don't specify needing a
> converted attribute could hit a cross-check which would blow up. This
> is perhaps the wrong approach and ugly though :).
>   - once all callers are migrated remove the delegation-to-Groups so
> that we can't ever be bitten on that attribute again. Ever.
> 
> Changes to how we do things:
>  - methods on 'domain objects' would never perform DB access. They
> must be only using non-lazy attributes - the builder interface would
> populate appropriate attributes (e.g., the columns we use (not all),
> and the related data we obtained (such as self.branding))

+1. Business domain objects should be POPOs, and their methods should be
about managing their internal runtime state.

>  - we'd need a variant of cachedproperty which can answer from cache
> but cannot populate it.

We should also consider the concept of query caches, whereby whole
results sets are cached, not just individual properties. And we should
also at least discuss whether a cache's life cycle should be allowed to
extend beyond the boundaries of a single web request.

<snip>

>  - Groups become responsible for injecting attributes we need into the
> individual represents-a-row objects
>

Personally, I'm having trouble with the term "Group" here. I think of it
more as a Manager eg BranchManager, although maybe that's more for the
lifecycle (CRUD) type stuff. XXXCollection comes to mind for read only
query type stuff but we already have some of those already like
BranchCollection. Perhaps XXXFinder?



> This can also be looked at as a combination of HQL/query builder - but
> its really much less capable than I'd expect an arbitrary HQL to be.
> 
> So - what do you guys think? Does it have legs?
> 

For me, for sure, thanks for putting pen to paper so to speak.

This type of fundamental change is often hard to get right using a
mailing list though. It really needs a combination of this discussion we
are having, some prototyping, and a sprint where some key people are
locked in a room together and aren't allowed out till they've got
something concrete to show.

Anyway, enough waffle.
Follow ups

Re: Lower query counts via different code structure
From: Robert Collins, 2010-11-01
References

Lower query counts via different code structure
From: Robert Collins, 2010-11-01