launchpad-dev team mailing list archive

Thread
Date

Re: exposed map/reduce API

To: Robert Collins <robertc@xxxxxxxxxxxxxxxxx>
From: Martin Pool <mbp@xxxxxxxxxxxxx>
Date: Thu, 16 Jun 2011 18:21:14 -0700
Cc: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <BANLkTike_J6a2i=jT7rcwt3=3Axs7dEt1w@mail.gmail.com>
Sender: martinpool@xxxxxxxxx

On 14 June 2011 13:05, Robert Collins <robertc@xxxxxxxxxxxxxxxxx> wrote:
> On Wed, Jun 15, 2011 at 6:34 AM, Martin Pool <mbp@xxxxxxxxxxxxx> wrote:
>> One idea that came up talking to Robert about the Services design was
>> exposing an external map/reduce interface, mentioned under
>> <https://dev.launchpad.net/ArchitectureGuide/ServicesRoadmap#A
>> map/reduce facility>.  It is pretty blue sky at the moment but I think
>> it is such an interesting idea it would be worth writing down.
>
> Are you interested in implementing a map reduce API ? There are quite
> a few things we'd make a lot better by doing one IMO.

I am interested, though I don't know if I'll have time to do it.

Having seen how much good stuff people get out of the API, but also
how much it is like sucking a camel through the eye of a needle, I
think there is a large potential win here.

> My mental sketch for this service would have it call forward: in
> python terms the signature is something like:
> Launchpad.mapreduce(objectsearch, mapfn, reducefn, resultendpoint)

So the objectsearch parameter is something that can be compiled into a
storm sql expression, so that we don't have to map over everything in
the database?

> resultendpoint would get a signed POST from LP with the results, when
> they are ready. However see below, we can be much more low key
> initiially and just use email.

I think it is quite desirable not to require the client have an smtp
or http listener.  For instance, I do not have one on my natted
laptop, but I might like to experiment with the API.  But perhaps it's
the pragmatic thing to start with.  Another alternative would be to
stick the results in the librarian and let them poll or long-get. (Hm,
can we give a predictable URL in advance?)

>
> Having folk replicate LP in an adhoc fashion isn't just inelegant: any
> new bug analysis task requires someone new to pull all 800K bugs +
> 830K bugtasks + 9M messages out of the DB, store it locally, and then
> process it. It makes running analysis a complex and time consuming
> task. Its great folk /can/ do it, but its also hard to support - our
> top timeout today is due to folk analysing hardware DB records - a
> 2.7M rows into the collection it starts timing out.

Right, I agree: this is awful for all parties, and yet many people do
go to the trouble of doing it, which suggests a better way would be
worthwhile.

>> So things like the kanban that want to say "give me everything
>> assigned to mbp or jam or ... and either inprogress or (fixreleased
>> and fixed in the last 30 days" could make a (say) javascript
>> expression of such and get back just the actually relevant bugs,
>> rather than fetching a lot more stuff and filtering client side.
>
> There are python sandboxes around we could use to, though javascript
> is perhaps easier to be confident in.Z

(Someone here at Velocity, I think BrowserMob, takes the fairly
creative approach of spinning up an ec2 instance holding the raw
results in a mysql database.  The user can do anything they want and
can only hurt themselves.  It's not exactly a good fit for us but it
is quite clever.)

> I wouldn't try to run map reduce jobs in a webserver context
> initially; its certainly possible if we wanted to aim at it, but we'd
> want oh 70%-80% use on the mapreduce cluster - we'd need an *awfully*
> large amount of jobs coming through it to need a parallel cluster
> large enough to process every bug in ubuntu in < 5 seconds.

I think mapreduce as such would not make sense.  Taking something that
can generate sensible database queries with non-enormous results, and
then doing some manipulation of the results, all capped by the
existing request timeout, could make sense.  The queries that
currently make up say a +bugs page all run in 2s (or whatever), and an
API call could reasonably be allowed to do a similar amount of work.
Perhaps it is better to steer straight for mapreduce if it's actually
cheap.  I have some fear of the pipelines in introducing new
infrastructure and concepts to run it.

> I think a very simple map reduce can be done as follows:
>  - use our existing api object representations
>  - allow forwarding the output of a map reduce back into map reduce (chaining)
>  - use the existing Job framework to dispatch and manage map reduce runs
>  - start with a concurrency limit of 2
>  - pick whatever language is most easily sandboxed
>  - send results to the submitters preferred email address.
>
> These points are picked to minimise development time: if we find that
> the result is awesome, we can look at doing more-complex but even
> greater return approaches such as getting http://discoproject.org to
> sandbox and using it instead; letting object searches return bug +
> tasks and so forth.

Well, that's pretty awesome to hear you think this could be simple.
Do you imagine this mapreduce would talk directly to the db like other
jobs?  I can imagine that could work...

So, actually, we could start by only supporting the objectsearch
parameter, not any map or reduce function.  I think many of these
calls really can just be expressed in sql.

Very interesting...

Martin

Follow ups

Re: exposed map/reduce API
From: Robert Collins, 2011-06-17

References

exposed map/reduce API
From: Martin Pool, 2011-06-14
Re: exposed map/reduce API
From: Robert Collins, 2011-06-14