launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #05754
Re: Heads Up: Code Imports appear broken!
On Fri, 19 Nov 2010 15:17:15 +1300, Michael Hudson <michael.hudson@xxxxxxxxxxxxx> wrote:
> On Fri, 19 Nov 2010 01:16:26 +0000, Max Bowsher <maxb@xxxxxxx> wrote:
> > It would appear that, as far as I can see, no code imports have
> > successfully completed since some time on 2010-11-15.
> >
> > The code import machines are all full of running jobs, all of which
> > appear to sit around doing nothing much, until they get canceled an hour
> > after they start.
> >
> > Because every dispatched importd job is now taking 60 minutes, which is
> > much longer than the average when things work properly, there are now
> > *very many* imports which are queuing for an importd execution, hence
> > the web UI is displaying "The next import is scheduled to run as soon as
> > possible." for imports which won't actually be attempted for hours?
> > days? (and will then fail.)
>
> I've got a few more bits of information to add to this. The logs look
> like this:
>
> 2010-11-18 22:35:30 INFO [chan bzr SocketAsChannelAdapter] Opened sftp connection (server version 3)
> 2010-11-18 22:35:32 INFO [chan bzr SocketAsChannelAdapter] Opened sftp connection (server version 3)
> Exception KeyboardInterrupt: KeyboardInterrupt() in <function terminate at 0x9ba86bc> ignored
> 2010-11-18 23:35:35 INFO [chan bzr SocketAsChannelAdapter] Opened sftp connection (server version 3)
> 2010-11-18 23:35:35 INFO [chan bzr SocketAsChannelAdapter] Opened sftp connection (server version 3)
> Import failed:
> Traceback (most recent call last):
> Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.
>
> The "Opened sftp connection" log lines are new-ish, but they have been
> present for at least a few weeks, so they are not that closely related
> to the issue.
>
> The issue appears on staging too, which suggests to me that it is more
> likely to be a code change than an environmental one.
>
> The importd user on the importd slaves can still sftp to the central
> store, at least in a trivial way.
>
> Although the problem appeared soon after the dustup with the XML-RPC
> service over the weekend, it doesn't actually seem to be related: there
> were some successful imports after all that drama.
>
> There is no LOSA around today, which makes finding more information
> hard.
>
> Now some guesswork.
>
> There was a nodowntime rollout on the 15th. I bet it introduced the
> problem.
>
> My utter WAG is that it was the upgrade to bzr 2.2.1 that caused the
> problem.
My guesses were correct. Thanks to a friendly sysadmin, we're now
running 2.2.0 on the code import slaves again, and the backlog is being
churned through.
https://bugs.launchpad.net/bzr/+bug/677305 seems to have been the
underlying problem.
Cheers,
mwh
References