← Back to team overview

launchpad-dev team mailing list archive

Re: persistence layer sketch/strawman


Hi Robert

I'm really happy you've done this :-) The principals you describe in the
email and LEP reflect the sort of thing I gave Tim an ear bashing about
when I first started so I think it's great that we can look to progress
things in this area.

> One major principle I have is that on-demand loading is actively
> harmful in high performance software: while its not as convenience for
> adhoc scripts, its very hard to reliably avoid poor performance due to
> object traversal triggering expensive (e.g. 3-4ms) queries thousands
> of times in a single web query.

+1. I have also found that explicitly considering the data requirements
of the specific use case at hand, rather than wandering over the object
model resolving references as needed in an adhoc way, forces more
thought to be put into how best to efficiently load/query the underlying
data model. This is especially true when it comes to loading one-many
references and avoiding the N+1 select problem.

One statement in the LEP really highlights to me a major cause of the
problem: "much of the Zope machinery we use is hostile to that structure
: it assumes individual Python object and attribute access is cheap or
free". So we need to put in place a solution which mitigates this issue
and I think the LEP is working towards a solution which does that.

> Actual query code should go in/under the persistence layer. I imagine
> we'll have some general code and some code specific to the backend
> stores that we have (which today is the three pg stores - session,
> launchpad, launchpad_slave). I include in 'actual query code'
> collection size estimates. It would be nice to enable systematic use
> of size estimates in this layer, though its not a deliberate scoped
> task.

Just to check I understand what you are saying - in the past, I've
augmented collection queries with a batch size to reflect the required
number of elements to be loaded per query, often reflecting for example
the pagination size of the view or processing batch size of some
business logic operation; the idea being that there's a chance that not
all of the collection will be required (eg if the user only views the
first page of results) so why load what's not likely to be needed. Is
this what you mean by "size estimates"?

> Code that *requests* a partial object graph should become a consumer
> of the persistence layer.
> Code that works on objects must live above the persistence layer.


pseudo code

> Relations that are not traversed are not queried; we can select down
> to individual attributes in a similar fashion to the .filter attribute
> - using a .get or .retrieve attribute.

As well as not querying relations that are not required, it's also key
to minimise the query count to get the data (attributes and collections)
that is required, and execute the most efficient queries possible
according to the underlying database's capabilities and quirks. One
thing I don't think I have seen explicitly mentioned is the notion of an
object query language (or maybe I missed it). While conceivably a
separate problem and out of scope to what's being discussed here, the
type of high level constructs available tend to make it easier for
developers to specify what they want in terms closer aligned to the end
representation of the data, and help constrain the ways in which data is
accessed and hence improve the ability to optimise under the covers as
part of the mapping from the object query language to sql.

I think the pseudo code which I have snipped out reflects it, but in my
view we also need to ensure where is a clear separation between the
verbs/actions and the nouns/model. eg so the bugs collection class
(whatever it is called - IBugCollection, IBugs, IBugManager) should have
methods like findUnassignedBugs() or findBugAssignedTo(IPerson) rather
than the apis just mentioned being on the IBug interface.

One extra point I would like to make in relation to the LEP:

"Not requiring a cache in the layer"

In my view, we need to distinguish the type of cache we are talking
about. If we are talking about a L2 type cache with an object
lifecycle/ttl which spans individual system interactions with the
persistence layer and which implies the need for replication in a
clustered environment to maintain data consistency, then I agree that we
should try and avoid the need for this. However, I think some form of
caching within the bounds of a single interaction is useful and perhaps
necessary to minimise unnecessary hits on the database. The cache is
discarded when the interaction ends but allows objects already loaded
(whether via a single getById type operation or as a result of a query)
to be accessed from the cache if required. This is all done
transparently by the implementation so no explicit user code is required
to make it work. Hibernate uses this concept with its Session construct.

There, that's my 2c.

Follow ups
