launchpad-dev team mailing list archive

Thread
Date

Re: brainstorm: cheaper API collection iteration

To: Robert Collins <robert.collins@xxxxxxxxxxxxx>
From: Martin Pool <mbp@xxxxxxxxxxxxx>
Date: Fri, 25 Mar 2011 13:47:25 +1100
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <AANLkTi=XddoCGsWJk3n_GAedid5SiQDKcrRO4X_9qvAp@mail.gmail.com>
Sender: martinpool@xxxxxxxxx

Hi, thanks for raising this.

It's very much on my mind because I've been working with/on lp:kanban,
which wants to regularly poll bug collections from Launchpad.  Even
running from another machine in the Canonical data centre (so mere ms
from Launchpad) it's pretty slow.  If anyone is going to work on this
I would recommend you try running it and have a look at what network
traffic it actually does.

A few thoughts, somewhat related to Robert's points:

Nearly any practical operation on bugs is going to take O(n) round
trips, because nearly anything will want both the bug and the
bugtask(s) and you cannot get them both in batches.  It would be good
to work out a way to send them all at once to avoid potato
programming.  I don't know about the limitations of the mapping layer
but conceptually and in terms of network representation it should not
be that hard to say that tasks are value objects sent inline with the
enclosing bug.  Fixing this kind of thing would speed up many clients,
and I expect also reduce the server load.  Similarly for merge
proposals.

Launchpad could either take an opinion that tasks should just always
be sent in line with bugs, or could have a way for the client to
express "please send me all the comments too", which is the expand
case.

http://pad.lv/712924 shows that searchTasks is much slower than
reading a collection.  I have not dug into why but it may be that the
batch size is smaller.

I find it kind of weird that Launchpad in some ways is quite pedantic
about HTTP and being restful, and then in others has very rpc-like
non-restful methods to get collections and do searches.  I would like
if the URLs looked more like collections, ie
/~mbp/bugs/status=In+Progress/ or something.  It might make them more
cacheable, and more understandable for client developers.  I have a
sense that people do not look enough at what actually goes across the
wire, and things would be better if they did.

If you look at almost any other web service with an api, they talk
about the URLs that are generated and give examples of the XML/json
that is sent back.  Launchpad acts like it might as well be a black
box (or black pipe), or CORBA.
<http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing>

If an api client wants to read for example all the open bugs assigned
to ~mbp, there does not seem much point making them read multiple
batches.  There will just be more roundtrips and more overhead for
both the client and the server, and more risk of tearing between
pages.  Unlimited size collections would be very good.

I think many clients are going to be interested in
most-recently-changed collections, either because they are drawing a
timeline-like view, or because they are updating a local cache, or
_could_ have a local cache if it was easy to keep it fresh.  This can
suffer tearing, but of a very predictable and manageable kind.

More generally I agree that batching based on actual keys will be just
as good, or better, for clients, and easier on the database.  (Again I
can't help thinking if you thought about the URLs rather than thinking
about the client-specific abstraction of a Python list this might come
out more naturally.)

For example, lp:kanban really only wants to know about bugs that have
changed since last time it was run; if lp and lplib easily and
efficiently supported that it could be doing about 1 http request per
update, not thousands, with something like:

  /~mbp/bugs/assigned/open/modified_since=123123123

I could say a bunch of stuff about what a better client (Wrested)
would look like especially with caching but that's perhaps separate.
You could have a smart cache that knew how to freshen itself with a
query such as given above, without rereading all bugs.

I think web UI batching is nearly pointless: if there are hundreds of
bugs it will be an act of desperation to actually page through them.
People want a more selective search.  Seeing the total count is
useful, and perhaps being able to page through them just to get a
sense of what's there will be occasionally used.  (We could get data
on this by seeing what fraction of requests for a bug page have an
offset other than 0.)  Spiders will walk through it; perhaps there
should be a spider-specific dump view grouped by date so the pages
stay stable.

hth
Martin

Follow ups

Re: brainstorm: cheaper API collection iteration
From: Francis J. Lacoste, 2011-03-25

References

brainstorm: cheaper API collection iteration
From: Robert Collins, 2011-03-25