launchpad-dev team mailing list archive

Thread
Date

Re: sql derived data, eager loading and cachedproperty

To: Clint Byrum <clint.byrum@xxxxxxxxxxxxx>
From: Robert Collins <robert.collins@xxxxxxxxxxxxx>
Date: Fri, 13 Aug 2010 20:38:26 +1200
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <94B93267-B340-4A9D-9FFD-A9BDFF6302CB@canonical.com>
Sender: robertc@xxxxxxxxxxxxxxxxx

On Fri, Aug 13, 2010 at 4:27 PM, Clint Byrum <clint.byrum@xxxxxxxxxxxxx> wrote:
>
> On Aug 11, 2010, at 8:34 PM, Robert Collins wrote:
>>
>> All the times add up substantially, get 100 products in a productset
>> and we're doing 300 queries just-like-that. (Launchpad-project, or
>> zope?)
>>
>> Thats roughly, in pseudo code - because I haven't looked at this
>> particular cases python yet:
>> for product in (some query):
>>   for distro in product.getDistroSeries():
>>      for milestone in distro.getMileStones()
>
> What happens inside this most inner for loop is actually what is most interesting to me.
>
> If this is what happens:
>
> product_distro_milestones.append(x)

Ah, so I was flattening out a tal template :)

The inner loop will be pulling attributes out of the milestone and
rendering them.

The whole thing in principle can be parallelised but...

> Then this can actually be parallelized quite a bit, whether through
> async/event driven methods or threads is really more of a preference.
> Whether you actually want to bomb the DB server with 2, or 10, or 100
> queries at once is another question entirely.

The DB can generate a data set for this sort of thing in 50-100ms, if
we get all the rows at once. Doing multiple queries to express a
single intent seems a bit unpleasant :) - I'm very interested in why
you seem to suggest this is a positive thing?

> Essentially this is the classic case for cores/spindles/RAM for scaling,
> in that if none of these steps are dependent on one another, it may be
> trivial to get them to run all at once.

Right - I would express the intent to the DB layer and expect it to
parallelise appropriate - e.g. like cassandra.

> I'm partial to Gearman for farming work like this out btw. ;)
>
> www.gearman.org

I had been meaning to ask - we have some gearman like needs at the
moment, and are installing rabbitMQ. Do you see gearman as a
replacement for rabbit, or something that might be complementary? I
don't think we'd really want the cognitive overhead of two
nearly-identical things.

> Of course, another question is why are these loops running queries
> instead of building criteria for selects/unions? (Disclaimer: I'm still
> not very familiar with Launchpad's data model)

That is actually the point of this thread: the ORM usage style we have
drives a its-like-python API for talking to the DB, (which we got from
sqlobject waay back, but storm also encourages/is designed to deliver
this sort of thing).

I think that we can get much better results by separating the concerns out.

...

Trimmed caching digression - I completely agree with what you're
saying, and there are in fact many caveats around what we can do now,
and should aim for in the future.

> Far more interesting, to me, is to move data like this into scale-out
> de-normalized data caches that are simply more oriented around the queries
> that give the most pain from a relational standpoint. Sometimes
> materialized views make sense for this, other times pushing into key/value
> stores works. Sometimes, while seemingly not a "search", pushing complex
> queries into a search engine like SOLR or Sphinx works wonderfully for
> this.

I want to put a group together to do a solr prototype for LP :- if you
can get some cycles to help us with that, that would be awesome. We
have a number of denormalised caches in LP's data model already - some
are pulling their weight, some perhaps not :- but our performance is
currently all over the shop, so its a bit hard to tell.

> But generally, if you're waiting for a user to ask for a complex view,
> that is a lot of work (even with caching) that you could have done as
> soon as the data was written (asynchronously w/ a queueing system).

Yep - thats really at a different level the performance analysis stack
- its something we do already in some places : as Danilo says, we've
solved some pretty hairy DB performance things in the past and
continue to do so.

-Rob

Follow ups

Re: sql derived data, eager loading and cachedproperty
From: Aaron Bentley, 2010-08-23
Re: sql derived data, eager loading and cachedproperty
From: Michael Hudson, 2010-08-15

References

sql derived data, eager loading and cachedproperty
From: Robert Collins, 2010-08-12
Re: sql derived data, eager loading and cachedproperty
From: Clint Byrum, 2010-08-13