launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #05171
Re: Failures in ec2
On Thursday 14 October 2010 12:42:27 Jonathan Lange wrote:
> On Thu, Oct 14, 2010 at 12:25 PM, Steve Kowalik
>
> <steve.kowalik@xxxxxxxxxxxxx> wrote:
> > Hi guys,
> >
> > I seem to constantly get thread-based failures when submitting a
> > branch to ec2, or when Hudson performs a build. I got sick enough of it
> > today to actually sit down and talk to Robert and Maris about it, and
> > did a little bit of debugging.
> >
> > It does seem like certain tests will leave a thread hanging
> > around, which then zope gets caught up in.
> >
> > test:
> > lp.codehosting.puller.tests.test_worker.TestWorkerProgressReporting.test_
> > n etwork
> > Thread Name: MainThread
> > Is Daemon?: False
> > Thread target: None
> >
> > Thread Name: Thread-18
> > Is Daemon?: True
> > Thread target: <bound method HttpServer._http_start of
> > HttpServer(127.0.0.1:3711
> > 1)>
> >
> > Thread Name: Thread-20
> > Is Daemon?: 1
> > Thread target: <bound method
> > TestingThreadingHTTPServer.process_request_thread o
> > f <bzrlib.tests.http_server.TestingThreadingHTTPServer instance at
> > 0x6e78128>>
> >
> > time: 2010-10-14 10:53:44.596568Z
> > successful:
> > lp.codehosting.puller.tests.test_worker.TestWorkerProgressReporting.test_
> > network test:
> > lp.codehosting.puller.tests.test_worker.TestWorkerProgressReporting.test_
> > network tags: zope:threads
> > error:
> > lp.codehosting.puller.tests.test_worker.TestWorkerProgressReporting.test_
> > network [ multipart
> > Content-Type: text/plain;charset=utf8
> > garbage
> > 34
> > [<Thread(Thread-18, started daemon 47971215480592)>]0
> > ]
> > time: 2010-10-14 10:53:44.596847Z
> >
> > So it looks like the HttpServer instance needs to be killed in the test
> > or in the teardown? I'm at a little bit of a loss, personally, so
> > thought I'd throw it out there first.
>
> This seems a lot like https://bugs.edge.launchpad.net/bzr/+bug/193253,
> although there it's a socket leaking check rather than a thread
> leaking check. I don't know what's caused it to regress.
>
> Specifically, there's code hidden by bzrlib that isn't cleaning up
> after itself. Whether it should or not is an open question. From one
> point of view, our thread checker is being overzealous, catching a
> leak in something that's never going to affect production. From
> another point of view, HttpServer.stop_server() should darn well stop
> the server.
>
> Anyway, fixes are:
> * Fix bzrlib.tests.http_server to clean up its thread in stop_server
> * Find some way of getting the thread leaking checker to ignore the
> thread
>
> Perhaps there are more fundamental issues that could be address. Them,
> I leave to Rob.
>
> CCing vila because of the history.
This and other failures have been occuring on Steve's Hudson instance ever
since he started it:
https://hudson.wedontsleep.org/job/devel/104/
I guess we've all been too busy to notice/deal with this, but I do find it
disturbing that we can't get a consistent test run in different environments.
The failures above don't appear in buildbot but they are in ec2.
It seems like there's a pattern to do with threads and/or external processes;
I hope someone with more knowledge than I can diagnose them. For now I am
copying jml and looking at Rob. :-)
Cheers.
References