← Back to team overview

launchpad-dev team mailing list archive

Re: Failures in ec2

 

On Thursday 14 October 2010 12:42:27 Jonathan Lange wrote:
> On Thu, Oct 14, 2010 at 12:25 PM, Steve Kowalik
> 
> <steve.kowalik@xxxxxxxxxxxxx> wrote:
> > Hi guys,
> > 
> >        I seem to constantly get thread-based failures when submitting a
> > branch to ec2, or when Hudson performs a build. I got sick enough of it
> > today to actually sit down and talk to Robert and Maris about it, and
> > did a little bit of debugging.
> > 
> >        It does seem like certain tests will leave a thread hanging
> > around, which then zope gets caught up in.
> > 
> > test:
> > lp.codehosting.puller.tests.test_worker.TestWorkerProgressReporting.test_
> > n etwork
> > Thread Name: MainThread
> > Is Daemon?: False
> > Thread target: None
> > 
> > Thread Name: Thread-18
> > Is Daemon?: True
> > Thread target: <bound method HttpServer._http_start of
> > HttpServer(127.0.0.1:3711
> > 1)>
> > 
> > Thread Name: Thread-20
> > Is Daemon?: 1
> > Thread target: <bound method
> > TestingThreadingHTTPServer.process_request_thread o
> > f <bzrlib.tests.http_server.TestingThreadingHTTPServer instance at
> > 0x6e78128>>
> > 
> > time: 2010-10-14 10:53:44.596568Z
> > successful:
> > lp.codehosting.puller.tests.test_worker.TestWorkerProgressReporting.test_
> > network test:
> > lp.codehosting.puller.tests.test_worker.TestWorkerProgressReporting.test_
> > network tags: zope:threads
> > error:
> > lp.codehosting.puller.tests.test_worker.TestWorkerProgressReporting.test_
> > network [ multipart
> > Content-Type: text/plain;charset=utf8
> > garbage
> > 34
> > [<Thread(Thread-18, started daemon 47971215480592)>]0
> > ]
> > time: 2010-10-14 10:53:44.596847Z
> > 
> > So it looks like the HttpServer instance needs to be killed in the test
> > or in the teardown? I'm at a little bit of a loss, personally, so
> > thought I'd throw it out there first.
> 
> This seems a lot like https://bugs.edge.launchpad.net/bzr/+bug/193253,
> although there it's a socket leaking check rather than a thread
> leaking check. I don't know what's caused it to regress.
> 
> Specifically, there's code hidden by bzrlib that isn't cleaning up
> after itself. Whether it should or not is an open question. From one
> point of view, our thread checker is being overzealous, catching a
> leak in something that's never going to affect production. From
> another point of view, HttpServer.stop_server() should darn well stop
> the server.
> 
> Anyway, fixes are:
>   * Fix bzrlib.tests.http_server to clean up its thread in stop_server
>   * Find some way of getting the thread leaking checker to ignore the
> thread
> 
> Perhaps there are more fundamental issues that could be address. Them,
> I leave to Rob.
> 
> CCing vila because of the history.

This and other failures have been occuring on Steve's Hudson instance ever 
since he started it:

https://hudson.wedontsleep.org/job/devel/104/

I guess we've all been too busy to notice/deal with this, but I do find it 
disturbing that we can't get a consistent test run in different environments.  
The failures above don't appear in buildbot but they are in ec2.

It seems like there's a pattern to do with threads and/or external processes; 
I hope someone with more knowledge than I can diagnose them.  For now I am 
copying jml and looking at Rob.  :-)

Cheers.



References