← Back to team overview

launchpad-dev team mailing list archive

Re: exposed map/reduce API

 

On Fri, Jun 17, 2011 at 1:21 PM, Martin Pool <mbp@xxxxxxxxxxxxx> wrote:


>> My mental sketch for this service would have it call forward: in
>> python terms the signature is something like:
>> Launchpad.mapreduce(objectsearch, mapfn, reducefn, resultendpoint)
>
> So the objectsearch parameter is something that can be compiled into a
> storm sql expression, so that we don't have to map over everything in
> the database?

Something like that. I think we'd want to create -one- search language
and let that evolve as needed.

>> resultendpoint would get a signed POST from LP with the results, when
>> they are ready. However see below, we can be much more low key
>> initiially and just use email.
>
> I think it is quite desirable not to require the client have an smtp
> or http listener.  For instance, I do not have one on my natted
> laptop, but I might like to experiment with the API.  But perhaps it's
> the pragmatic thing to start with.  Another alternative would be to
> stick the results in the librarian and let them poll or long-get. (Hm,
> can we give a predictable URL in advance?)

Email is approximately zero development to make happen. We can
obviously iterate towards any degree of polish.

I think theres room to aim at different sorts of completion long term,
but POST - passing a message forward - is pretty standard for this
sort of thing. long poll etc can be built on top of that.

>> I wouldn't try to run map reduce jobs in a webserver context
>> initially; its certainly possible if we wanted to aim at it, but we'd
>> want oh 70%-80% use on the mapreduce cluster - we'd need an *awfully*
>> large amount of jobs coming through it to need a parallel cluster
>> large enough to process every bug in ubuntu in < 5 seconds.
>
> I think mapreduce as such would not make sense.  Taking something that
> can generate sensible database queries with non-enormous results, and
> then doing some manipulation of the results, all capped by the
> existing request timeout, could make sense.  The queries that
> currently make up say a +bugs page all run in 2s (or whatever), and an
> API call could reasonably be allowed to do a similar amount of work.
> Perhaps it is better to steer straight for mapreduce if it's actually
> cheap.  I have some fear of the pipelines in introducing new
> infrastructure and concepts to run it.

Constraining this to run on at most 75 bugs or something would be
(IMO) useless. I don't think the problems are close enough to consider
implementing one solution, and I don't think a browser-scale system
would scale to working on the million-row datasets we have.

> Well, that's pretty awesome to hear you think this could be simple.
> Do you imagine this mapreduce would talk directly to the db like other
> jobs?  I can imagine that could work...

I think the driver pulling stuff out might; no other part of it would,
and it would require managing its transactions to avoid
long-transaction issues.

-Rob


References