← Back to team overview

launchpad-dev team mailing list archive

Re: Need help and ideas for a weird Poppy problem

 

FYI I figured this out in the end.

It turns out that writing OOPSes is deferred to a thread pool.  However, the 
AMQP port was not open on the firewall, so once 10 OOPSes were queued up, it 
was hanging.

D'oh ...

On Tuesday 20 December 2011 15:30:28 Julian Edwards wrote:
> Hi folks
> 
> As you may know, Poppy is the Twisted-based FTP/SFTP server for uploading
> packages to Soyuz. I recently landed a change to fix its logging (along with
> a few other Twisted-based services such as the librarian, branch-puller
> etc) so that it uses the python-oops stuff correctly.
> 
> It was released last Friday and within 10 minutes the instance on the PPA
> machine (germanium) went into a weird state where it was unable to contact
> the xmlrpc-private auth service running on the appservers, and hence all
> SFTP requests fail.
> 
> Here is an example from the log of an unsuccessful XMLRPC request:
> 
> 2011-12-16 10:27:38+0000 [SSHService ssh-userauth on
> KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory
> <twisted.web.xmlrpc._QueryFactory instance at 0x7b53248>
> ... wait ...
> 2011-12-16 10:28:08+0000 [-] [Failure instance: Traceback (failure with no
> frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused
> connection failure.
>         ]
> 2011-12-16 10:28:08+0000 [-] udienz failed auth publickey
> 2011-12-16 10:28:08+0000 [-] unauthorized login: unable to get avatar id
> 2011-12-16 10:28:08+0000 [-] Stopping factory
> <twisted.web.xmlrpc._QueryFactory instance at 0x7b53248>
> 
> And here is one that works:
> 
> 2011-12-16 10:15:11+0000 [SSHService ssh-userauth on
> KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory
> <twisted.web.xmlrpc._QueryFactory instance at 0x6737128>
> 2011-12-16 10:15:11+0000 [QueryProtocol,client] Stopping factory
> <twisted.web.xmlrpc._QueryFactory instance at 0x6737128>
> 
> Because it's a 30 second timeout, this timeout error message is indicative
> in my experience of the TCP SYN package not being ACKed (timeouts for open
> connections are much, much longer).  However, restarting the Poppy instance
> will make things work again, so I'm not sure whether it's a code problem or
> an infrastructure problem.
> 
> We are currently running a very old revision of code on germanium so it's
> blocking further rollouts on there.  Oddly, this only affects the PPA
> machine, not the Poppy on cocoplum (the Ubuntu machine).  I've also blasted
> hundreds of connections at the dogfood box to try and make it fail, and it
> doesn't.  It's also worth noting that the instance on germanium also
> occasionally gets problems contacting the keyserver when it's trying to
> verify GPG signatures, which requires a restart to fix.
> 
> Since I'm at a total loss as to what to do next, I am going to put the
> latest code back on germanium tomorrow and run it in production again so I
> can gather more data when it goes wrong.
> 
> In the meantime, if anyone can come up with any ideas on how to figure out
> what's going on here I'd really appreciate it!
> 
> Cheers
> J
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~launchpad-dev
> Post to     : launchpad-dev@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~launchpad-dev
> More help   : https://help.launchpad.net/ListHelp


Follow ups

References