launchpad-dev team mailing list archive

Thread
Date
Re: exposed map/reduce API

To: Martin Pool <mbp@xxxxxxxxxxxxx>
From: Robert Collins <robertc@xxxxxxxxxxxxxxxxx>
Date: Wed, 15 Jun 2011 08:05:27 +1200
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <BANLkTi=FaM2z8gHH-DvCsEJ17d2xneGBPw@mail.gmail.com>
On Wed, Jun 15, 2011 at 6:34 AM, Martin Pool <mbp@xxxxxxxxxxxxx> wrote:
> One idea that came up talking to Robert about the Services design was
> exposing an external map/reduce interface, mentioned under
> <https://dev.launchpad.net/ArchitectureGuide/ServicesRoadmap#A
> map/reduce facility>.  It is pretty blue sky at the moment but I think
> it is such an interesting idea it would be worth writing down.

Are you interested in implementing a map reduce API ? There are quite
a few things we'd make a lot better by doing one IMO.

> 2- Another approach is to make it easier for the client to maintain an
> offline cache by emphasing "get me changes since date X" or "get me
> objects ordered by last change" (key cases like bugs already exist);
> and a client library that will make intelligent use of this abstracted
> from the application code.  I think Arsenal does this.
>
> Getting better apis, and better handling of cached results, would let
> API clients do totally general work with probably something like 10x
> to 100x fewer API calls, correspondingly faster time, and nearly that
> much less Launchpad server load.
>
> 3- Robert pointed out that having every API user keep a replicated
> copy of parts of the Launchpad database is perhaps not the most
> elegant solution, compared to doing this work on the server.  They
> could instead send a kind of map/reduce expression to the server and
> get back the results.

My mental sketch for this service would have it call forward: in
python terms the signature is something like:
Launchpad.mapreduce(objectsearch, mapfn, reducefn, resultendpoint)

resultendpoint would get a signed POST from LP with the results, when
they are ready. However see below, we can be much more low key
initiially and just use email.

Having folk replicate LP in an adhoc fashion isn't just inelegant: any
new bug analysis task requires someone new to pull all 800K bugs +
830K bugtasks + 9M messages out of the DB, store it locally, and then
process it. It makes running analysis a complex and time consuming
task. Its great folk /can/ do it, but its also hard to support - our
top timeout today is due to folk analysing hardware DB records - a
2.7M rows into the collection it starts timing out.

And yes, we can (and will) do things to make handling of such large
collections better, but letting the core analysis code run in the
datacentre, in parallel, on shared resources seems like a great way to
offer a better experience for folk.

We can combine restrictions that filter the incoming data with map
reduce - for instance, analysing the last months bugs would only
process the last months bugs using our date based index.

> So things like the kanban that want to say "give me everything
> assigned to mbp or jam or ... and either inprogress or (fixreleased
> and fixed in the last 30 days" could make a (say) javascript
> expression of such and get back just the actually relevant bugs,
> rather than fetching a lot more stuff and filtering client side.

There are python sandboxes around we could use to, though javascript
is perhaps easier to be confident in.Z

> Tools that want to count or summarize bugs in various states can
> obviously reduce it on the server side too.
>
> One issue in doing this would be designing/choosing an expression
> language and deploying it.
>
> Perhaps a larger issue is that some of these jobs may take a long
> time; perhaps longer than is realistic for a single web request;
> certainly longer than is permitted in a single call at the moment.
> Badly designed calls might create a lot of load.   So possibly this
> should be done out of a separate data warehouse, which would also be a
> chance to move it into a form that is more suited to mapreduce
> queries.

I wouldn't try to run map reduce jobs in a webserver context
initially; its certainly possible if we wanted to aim at it, but we'd
want oh 70%-80% use on the mapreduce cluster - we'd need an *awfully*
large amount of jobs coming through it to need a parallel cluster
large enough to process every bug in ubuntu in < 5 seconds.

But that said  , In terms of the database, we could handle maybe 10
concurrent mapreduce jobs today without blinking (where a mapreduce
job means 50% utilisation of a DB cpu. Moving stuff off of APIs onto
mapreduce would probably free a lot of resources too - we use 5 cores
on the master DB at the moment for webapp (including API) traffic -
and all API traffic is on the master today: in a mapreduce all the
traffic would be on the slave DB's, increasing our scalability.

> This does seem kind of long a long path, so I wonder if there are
> mapreducey things that can be done within the existing rest
> synchronous real-data setup.

I think a very simple map reduce can be done as follows:
 - use our existing api object representations
 - allow forwarding the output of a map reduce back into map reduce (chaining)
 - use the existing Job framework to dispatch and manage map reduce runs
 - start with a concurrency limit of 2
 - pick whatever language is most easily sandboxed
 - send results to the submitters preferred email address.

These points are picked to minimise development time: if we find that
the result is awesome, we can look at doing more-complex but even
greater return approaches such as getting http://discoproject.org to
sandbox and using it instead; letting object searches return bug +
tasks and so forth.

Theres a raft of other things like dealing with too many users, which
we can deal with by observing what happens.

-Rob
Follow ups

Re: exposed map/reduce API
From: Benji York, 2011-06-22
Re: exposed map/reduce API
From: Martin Pool, 2011-06-17
Re: exposed map/reduce API
From: Marc Tardif, 2011-06-17
References

exposed map/reduce API
From: Martin Pool, 2011-06-14