← Back to team overview

launchpad-dev team mailing list archive

Re: sql derived data, eager loading and cachedproperty

 

On Aug 11, 2010, at 8:34 PM, Robert Collins wrote:
> 
> All the times add up substantially, get 100 products in a productset
> and we're doing 300 queries just-like-that. (Launchpad-project, or
> zope?)
> 
> Thats roughly, in pseudo code - because I haven't looked at this
> particular cases python yet:
> for product in (some query):
>   for distro in product.getDistroSeries():
>      for milestone in distro.getMileStones()

What happens inside this most inner for loop is actually what is most interesting to me.

If this is what happens:

product_distro_milestones.append(x)

Then this can actually be parallelized quite a bit, whether through 
async/event driven methods or threads is really more of a preference.
Whether you actually want to bomb the DB server with 2, or 10, or 100
queries at once is another question entirely.

Essentially this is the classic case for cores/spindles/RAM for scaling,
in that if none of these steps are dependent on one another, it may be
trivial to get them to run all at once.

I'm partial to Gearman for farming work like this out btw. ;)

www.gearman.org

Of course, another question is why are these loops running queries 
instead of building criteria for selects/unions? (Disclaimer: I'm still
not very familiar with Launchpad's data model)

> 
> How can we cache things today? What are the options?
> 

I think caching at a low level, while a good idea, can often carry such a
high technical debt, that its not really worth it.

In most scenarios with caching, the obvious things come first. Common
views that don't need to be up to date in real time, very expensive
operations where any bit of popularity can cause a site outage, etc.

But once you've done that, the more complex performance problems show up
on your radar, and you're left with a dilemma. You've gotten really good
at caching simple data access patterns, and it has garnered a huge gain
in performance. But doing it for complex data structures does not scale
at the same rate.

It always jumps to mind that you can just cache/invalidate in the data
model. Even if the ORM and associated tools are incredible, doing this
is, as you suggest, non-trivial, and the amount of code written vs. the
actual gain in performance is usually is usually a disappointment. 
Meanwhile, miss one little area for invalidation/recalculation and you get
weird, hard to reproduce issues.

Far more interesting, to me, is to move data like this into scale-out
de-normalized data caches that are simply more oriented around the queries
that give the most pain from a relational standpoint. Sometimes 
materialized views make sense for this, other times pushing into key/value
stores works. Sometimes, while seemingly not a "search", pushing complex
queries into a search engine like SOLR or Sphinx works wonderfully for
this.

But generally, if you're waiting for a user to ask for a complex view,
that is a lot of work (even with caching) that you could have done as
soon as the data was written (asynchronously w/ a queueing system).




Follow ups

References