← Back to team overview

launchpad-dev team mailing list archive

Re: persistence layer sketch/strawman

 

On Mon, Nov 29, 2010 at 12:27 PM, Ian Booth <ian.booth@xxxxxxxxxxxxx> wrote:
>
>> Actual query code should go in/under the persistence layer. I imagine
>> we'll have some general code and some code specific to the backend
>> stores that we have (which today is the three pg stores - session,
>> launchpad, launchpad_slave). I include in 'actual query code'
>> collection size estimates. It would be nice to enable systematic use
>> of size estimates in this layer, though its not a deliberate scoped
>> task.
>>
>
> Just to check I understand what you are saying - in the past, I've
> augmented collection queries with a batch size to reflect the required
> number of elements to be loaded per query, often reflecting for example
> the pagination size of the view or processing batch size of some
> business logic operation; the idea being that there's a chance that not
> all of the collection will be required (eg if the user only views the
> first page of results) so why load what's not likely to be needed. Is
> this what you mean by "size estimates"?

There are two places where we need counts (vs actual data)
 - collection sizing for batching (both web UI (first/prev/next/last)
and API (472 items)
 - aggregates for side bars (50 open bugs)

In the former case, any figure we return is stale as soon as we return
it: more objects can be added or deleted.

In the latter case the figure is also stale, but unlike the former no
followup requests happen.

So, why does this matter? count(*) is rarely cheaper than executing
the full query in postgresql. If determining the ids for all open bugs
in Ubuntu takes 5 seconds, determining the count will also take 5
seconds. There are many open bugs in Ubuntu :)

What can we do about it?
 - tune queries to be 'fast enough' that count(*) is ok
 - denormalise counts rather than doing it by count(*)
 - use the table stats in postgresql to make educated guesses

I'm saying that this layer is the point at which we choose - code
above it doesn't keen to know /how/ (but perhaps may need to say
'accuracy is needed'); code below implements a chosen approach.

>> Relations that are not traversed are not queried; we can select down
>> to individual attributes in a similar fashion to the .filter attribute
>> - using a .get or .retrieve attribute.
>>
>
> As well as not querying relations that are not required, it's also key
> to minimise the query count to get the data (attributes and collections)
> that is required, and execute the most efficient queries possible
> according to the underlying database's capabilities and quirks.

A small note: less queries != more efficient. More efficient is more
efficient, and sometimes that means returning one table and then
querying directly a related table. Query count should be /constant/
once efficiency tradeoffs are established : but less is not always
more.

> One
> thing I don't think I have seen explicitly mentioned is the notion of an
> object query language (or maybe I missed it). While conceivably a
> separate problem and out of scope to what's being discussed here, the
> type of high level constructs available tend to make it easier for
> developers to specify what they want in terms closer aligned to the end
> representation of the data, and help constrain the ways in which data is
> accessed and hence improve the ability to optimise under the covers as
> part of the mapping from the object query language to sql.
>
> I think the pseudo code which I have snipped out reflects it, but in my
> view we also need to ensure where is a clear separation between the
> verbs/actions and the nouns/model. eg so the bugs collection class
> (whatever it is called - IBugCollection, IBugs, IBugManager) should have
> methods like findUnassignedBugs() or findBugAssignedTo(IPerson) rather
> than the apis just mentioned being on the IBug interface.

mmm, I'm not convinced either way here. I think we have room to play
it a couple of ways.

>
> One extra point I would like to make in relation to the LEP:
>
> "Not requiring a cache in the layer"
>
> In my view, we need to distinguish the type of cache we are talking
> about. If we are talking about a L2 type cache with an object
> lifecycle/ttl which spans individual system interactions with the
> persistence layer and which implies the need for replication in a
> clustered environment to maintain data consistency, then I agree that we
> should try and avoid the need for this. However, I think some form of
> caching within the bounds of a single interaction is useful and perhaps
> necessary to minimise unnecessary hits on the database. The cache is
> discarded when the interaction ends but allows objects already loaded
> (whether via a single getById type operation or as a result of a query)
> to be accessed from the cache if required. This is all done
> transparently by the implementation so no explicit user code is required
> to make it work. Hibernate uses this concept with its Session construct.

The hibernate Session lives at the ORM layer; this is above that layer
I think : there may well be an ORM layer, but lets not conflate
responsibilities.

-Rob



Follow ups

References