← Back to team overview

launchpad-reviewers team mailing list archive

Re: [Merge] lp:~julian-edwards/launchpad/buildd-manager-parallel-scan into lp:launchpad/devel

 

On Fri, Jul 30, 2010 at 1:04 PM, Julian Edwards
<julian.edwards@xxxxxxxxxxxxx> wrote:
> On Tuesday 27 July 2010 18:25:58 Jonathan Lange wrote:
>> Hey Julian,
>>
>> Thanks for doing this work. I'm very sorry that one of your first
>> exposures to Twisted has been in this fashion.
>
> Thanks for taking on the review Jono.  We've already gone through this on the
> phone so I'll summarise stuff inline.

Thanks.

...
>>
>> > +        buildqueue = store.find(
>> > +            BuildQueue,
>> > +            BuildQueue.job == Job.id,
>> > +            BuildQueue.builder == self.builder.id,
>> > +            Job._status == JobStatus.RUNNING,
>> > +            Job.date_started != None).one()
>> > +
>>
>> The comment makes me think there should be a method called
>> IBuilder.getActiveBuild().
>
> I've made IBuildQueueSet.getByActiveBuilder()
>

Thanks. Why that and not IBuilder.getActiveBuild()?

>> > +            return None
>> > +
>> > +        # If the builder is marked unavailable, don't dispatch anything.
>> > +        # Additionally see if we think there's a job running on it.  If
>> > +        # there is then it means the builder was just taken out of the
>> > +        # pool and we need to put the job back in the queue.
>>
>> It took me a couple of tries to parse this, but I don't think it's
>> because of the prose. :\
>
> I've changed the comment in an attempt to make it clearer what's going on.

Thanks.

>>
>> I had written "Is there a particular reason for not scheduling a scan
>> cycle here, given that you do in the other cases?", but I see that
>> it's because scheduleNextScanCycle does something involving the clock
>> and you only want to call it when you know the scan is done. I guess
>> ideally there would be a single call to next scan cycle, given that
>> it's something that you always want to do.
>>
>> What's really missing here is a function you call to do a single scan
>> and that returns a Deferred that only fires when the scan is done.
>
> Yes, that would really help.  I might sort that out in a new branch.
>

Fair enough.

>>
>> > +        else:
>> > +            # The scan for this builder didn't want to dispatch
>> > +            # anything, so we can finish this cycle.
>> > +            self.scheduleNextScanCycle()
>> > +
>>
>> This function doesn't return a Deferred either, and the code that
>> waits on its Deferreds firing is in a totally different place.
>>
>> AFAICT, this means that startCycle could be written along the following
>> lines:
>>
>> try:
>>   slave = self.scan()
>>   if slave is None:
>>     self.scheduleNextScanCycle()
>>   else:
>>     self.resumeAndDispatch(slave)
>> except:
>>   self.logger.info(...)
>>   self.scheduleNextScanCycle()
>>
>> without changing any of the logic.
>>
>> Although it's definitely a matter of opinion, I think the version here
>> gives a better idea of the overall flow of a build scan.
>
> I've changed the code to do this and....WIN.  Great suggestion, thanks.
>

My pleasure. Reading through the code in your diff, I think the new
version is definitely easier to read.

>>
>> > +    def initiateDispatch(self, resume_result, slave):
>> > +        """Start dispatching a build to a slave.
>> > +
>> > +        If the previous task in chain (slave resuming) has failed it will
>> > +        receive a `ResetBuilderRequest` instance as 'resume_result' and
>> > +        will immediately return that so the subsequent callback can
> collect
>> > +        it.
>> > +
>>
>> This seems unnecessarily complex. As in, the reality, not the prose.
>
> It was Celso's way of putting another "special" callable object on the end of
> the callables in the deferred chain, so that reset/failure scenarios can be
> handled.  It's not really the job of this branch to change that so I decided
> to add that docstring to at least make it documented.
>

Fair enough.

>>
>> > +        """
>> > +        if resume_result is not None:
>> > +            self.waitOnDeferredList()
>>
>> That method name is a little misleading. The code is not going to sit
>> here and wait until the Deferred list fires. Instead, it's going to
>> add a callback that gets fired when all of the Deferreds in the list
>> have fired. What's the point of calling it here?
>
> As we agreed on the call, it's renamed to slaveDialogueEnded.
>

Thanks, although in code we use American spelling, which would make it
slaveDialogEnded.

>>
>> > +        proxy = self._getProxyForSlave(slave)
>> > +        method, args = slave.calls.pop(0)
>> > +        d = proxy.callRemote(method, *args)
>> > +        d.addBoth(self.checkDispatch, method, slave)
>> > +        self._deferred_list.append(d)
>> > +        self.logger.debug('%s -> %s(%s)' % (slave, method, args))
>> > +
>> > +    def waitOnDeferredList(self):
>> > +        """After all the Deferreds are set up, wait on them."""
>> > +        dl = defer.DeferredList(self._deferred_list, consumeErrors=True)
>>
>> Looking at some of the code in evaluateDispatchResult, you probably
>> want to use lp.services.twistedsupport.gatherResults instead of
>> DeferredList.
>
> On IRC you suggested this:
>
>  dl = gatherResults(self._deferred_list)
>  dl.addBoth(self.evaluateDispatchResult)
>  return dl
>
> and in the callback:
>
>  dispatch_results = [
>    x for x in deferred_list_results if isinstance(x, BaseDispatchResult]
>
> gatherResults is invoking DeferredList with fireOnOneErrback=1 which is
> different behaviour to the existing.  I'd like to leave this for now until I
> understand the ramifications of that.
>

Fair enough.

>> > @@ -395,9 +453,9 @@
>> >         If it failed and it compromises the slave then return a
>> > corresponding `FailDispatchResult`, if it was a communication failure,
>> > simply reset the slave by returning a `ResetDispatchResult`.
>> > -
>> > -        Otherwise dispatch the next call if there are any and return
>> > None. """
>> > +        # XXX these DispatchResult classes are badly named and do the
>> > +        # same thing.  We need to fix that.
>>
>> I recommend just fixing it. It will be less work than trying to
>> explain why they are badly named :)
>
> As discussed on the phone - next branch.
>

Fair enough.


>>
>> >                 status, info = response
>> >                 if status == expected_status:
>> > -                    self._mayDispatch(slave)
>> > +                    self.callSlave(slave)
>> >                     return None
>> >             else:
>> >                 info = 'Unknown slave method: %s' % method
>> > @@ -423,76 +481,98 @@
>> >         self.logger.error(
>> >             '%s failed to dispatch (%s)' % (slave, info))
>> >
>> > -        self.slaveDone(slave)
>> > +        self.waitOnDeferredList()
>>
>> I wonder whether each of these calls is actually needed. If they are
>> all needed, how can you tell?
>
> I can't remember what you're referring to here.
>

Basically, there's nothing at all to hint that this is an appropriate
place to call slaveDialogEnded, and looking over the code, it seems
that it would be very easy to accidentally miss one. Perhaps in a
future branch, they'll all go through one choke-point.

In essence, this is a variant of "have a function that does a scan and
returns a Deferred that fires when it's done".

>>
>> >         return self.fail_result(slave, info)
>>
>> ...
>>
>> > +
>> > +class NewBuildersScanner:
>> > +    """If new builders appear, create a scanner for them."""
>> > +
>> > +    # How often to check for new builders, in seconds.
>> > +    SCAN_INTERVAL = 300
>> > +
>> > +    def __init__(self, manager):
>> > +        self.manager = manager
>> > +        # Avoid circular import.
>> > +        from lp.buildmaster.interfaces.builder import IBuilderSet
>> > +        self.current_builders = [
>> > +            builder.name for builder in getUtility(IBuilderSet)]
>> > +
>> > +    def scheduleScan(self):
>> > +        """Schedule a callback SCAN_INTERVAL seconds later."""
>> > +        reactor.callLater(self.SCAN_INTERVAL, self.scan)
>> > +
>>
>> This reminds me: check out LoopingCall some time. It'll make this code
>> a bit simpler, I reckon.
>
> Yep - I'll look into that, thanks.

No worries.

In the diff, you've made NewBuildersScanner inherit from TimeoutMixin.
AFAICT, it isn't necessary, and only confuses things since there's no
actual timeout involved.

Instead of mixing in, you could do the following in the constructor:

    def __init__(self, manager, clock=None):
        self.manager = manager
        if clock is None:
            clock = reactor
        self._clock = clock

And then change scheduleScan to just do:
        return self._clock.callLater(self.SCAN_INTERVAL, self.scan)

>> > === modified file 'lib/lp/buildmaster/tests/test_manager.py'
>> > --- lib/lp/buildmaster/tests/test_manager.py    2010-07-18 17:43:23 +0000
>> > +++ lib/lp/buildmaster/tests/test_manager.py    2010-07-22 15:26:12 +0000
> [snip]
>> >     def test_checkResume(self):
>> > -        """`BuilddManager.checkResume` is chained after resume requests.
>> > +        """`SlaveScanner.checkResume` is chained after resume requests.
>> >
>> >         If the resume request succeed it returns None, otherwise it
>> > returns a `ResetBuildResult` (the one in the test context) that will be
>> > @@ -355,24 +311,26 @@
>> >         # Make a failing slave that is requesting a resume.
>> >         slave = RecordingSlave('foo', 'http://foo.buildd:8221/',
>> > 'foo.host') slave.resume_requested = True
>> > -        slave.resumeSlave = lambda: defer.fail(Failure(('out', 'err',
>> > 1))) +        slave.resumeSlave = lambda: deferLater(
>> > +            reactor, 0, defer.fail, Failure(('out', 'err', 1)))
>>
>> Why did you need to make this change?
>
> As explained on the call - the old defer.fail was getting exectuted
> immediately.  I've no idea why and wasted 2 days at the Epic trying to figure
> it out :/
>

That sucks. I guess it's ok to leave it as it stands for now, but this
code really needs to change.

...
>> What you should be doing instead is always, always, always returning
>> Deferreds from things that make Deferreds or call something that
>> returns a Deferred.
>
> There's no need to tell *me* that now :)
>

Heh heh.

...
>>
>> I understand a lot of effort went into getting these tests to work. I
>> appreciate that, but it's very hard to have confidence in them given
>> the amount of monkey-patching that's going on and given the size of
>> the tests.
>>
>> I don't really know what to do about that. Sorry.
>
> Same for me and mwh.  It needs a serious code re-org to start returning
> Deferreds in the right places, and a total test re-write.
>
> But now is not the right time :(
>

Yeah. Soon though.

>>
>> > +class TestBuilddManager(TrialTestCase):
>> > +
>> > +    layer = LaunchpadZopelessLayer
>> > +
>> > +    def _stub_out_scheduleNextScanCycle(self):
>> > +        # stub out the code that adds a callLater, so that later tests
>> > +        # don't get surprises.
>> > +        self._saved_schedule = SlaveScanner.scheduleNextScanCycle
>> > +        def cleanup():
>> > +            SlaveScanner.scheduleNextScanCycle = self._saved_schedule
>> > +        self.addCleanup(cleanup)
>> > +        SlaveScanner.scheduleNextScanCycle = FakeMethod
>> > +
>>
>> Really, the SlaveScanner object should take an optional 'clock'
>> parameter that it uses to do reactor gibberish. See task.py and
>> test_task.py in our tree for examples.
>
> So I've replaced this code with:
>  self.patch(SlaveScanner, 'scheduleNextScanCycle', FakeMethod)
>
> And changed the calls to reactor to use a TimeoutMixin so I can override the
> clock, which is now manipulated in the tests.
>

As mentioned above, the TimeoutMixin part isn't really relevant. Great
to see the new tests not using the reactor!

> *phew*
>

Indeed!

> Please check the last few commits on the branch to see the changes.
> bzr diff -r 10911..
>

Thanks. I've referred to it while replying.

In summary:
  * I don't understand why getByActiveBuilder is on IBuildQueueSet
rather than being IBuilder.getActiveBuild().
  * slaveDialogueEnded => slaveDialogEnded
  * Don't inherit from TimeoutMixin in NewBuildersScanner.

Almost there!

jml
-- 
https://code.launchpad.net/~julian-edwards/launchpad/buildd-manager-parallel-scan/+merge/30672
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~julian-edwards/launchpad/buildd-manager-parallel-scan into lp:launchpad/devel.



Follow ups

References