launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #07358
exposed map/reduce API
One idea that came up talking to Robert about the Services design was
exposing an external map/reduce interface, mentioned under
<https://dev.launchpad.net/ArchitectureGuide/ServicesRoadmap#A
map/reduce facility>. It is pretty blue sky at the moment but I think
it is such an interesting idea it would be worth writing down.
Launchpad does a lot of API traffic, which could be (citation needed)
broadly grouped into many small users doing a few calls to eg file a
bug, plus a number of clients that do huge amounts of bulk traffic.
These bulk users are typically pulling data out of Launchpad to do
offline bulk digestion, for instance to draw
<http://people.canonical.com/~mbp/kanban/canonical-bazaar-kanban.html>
or many other different Ubuntu reports. Typically they want something
like "all bugs, with their tasks and mps, assigned to people in
~canonical-bazaar and either underway or finished in the last 30 days"
or "all in progress Ubuntu bugs" or "the top 1000 ubuntu bugs by
number of affected users." These tools take a long time to run, do
thousands of api requests, and probably thereby put a fair load on
Launchpad.
0- Some of these are things that could be done in Launchpad but are
not yet, such as the Kanban view or the Ubuntu . It might be good to
include some of them, such as a Kanban view of bugs or some of the
Ubuntu QA reports, within Launchpad itself, and the efforts to make it
easier to change Launchpad and easier to get spontaneously contributed
changes landed help with that. But many have to a greater or lesser
degree some user specific policy that might be hard to generalize;
Launchpad doesn't necessarily want to get every useful featuer within
the main ui; and many tools can be useful enough without being at the
level of quality that would justify being widely available.
1- Another way to tackle this is to provide more aggregated APIs, like
"give me all the bugs assigned and touched recently, with their tasks
and mps" in one go. The REST API enhancement LEP
<https://dev.launchpad.net/LEP/WebservicePerformance> goes towards
this by offering a generic expand-out feature, though I think it would
also be useful to have some APIs that just have hardcoded common sense
that eg if you get a bug you want its tasks too. That would let the
kanban software do O(1) call that gets a moderately large response to
draw everything.
2- Another approach is to make it easier for the client to maintain an
offline cache by emphasing "get me changes since date X" or "get me
objects ordered by last change" (key cases like bugs already exist);
and a client library that will make intelligent use of this abstracted
from the application code. I think Arsenal does this.
Getting better apis, and better handling of cached results, would let
API clients do totally general work with probably something like 10x
to 100x fewer API calls, correspondingly faster time, and nearly that
much less Launchpad server load.
3- Robert pointed out that having every API user keep a replicated
copy of parts of the Launchpad database is perhaps not the most
elegant solution, compared to doing this work on the server. They
could instead send a kind of map/reduce expression to the server and
get back the results.
So things like the kanban that want to say "give me everything
assigned to mbp or jam or ... and either inprogress or (fixreleased
and fixed in the last 30 days" could make a (say) javascript
expression of such and get back just the actually relevant bugs,
rather than fetching a lot more stuff and filtering client side.
Tools that want to count or summarize bugs in various states can
obviously reduce it on the server side too.
One issue in doing this would be designing/choosing an expression
language and deploying it.
Perhaps a larger issue is that some of these jobs may take a long
time; perhaps longer than is realistic for a single web request;
certainly longer than is permitted in a single call at the moment.
Badly designed calls might create a lot of load. So possibly this
should be done out of a separate data warehouse, which would also be a
chance to move it into a form that is more suited to mapreduce
queries.
This does seem kind of long a long path, so I wonder if there are
mapreducey things that can be done within the existing rest
synchronous real-data setup.
Martin
Follow ups