← Back to team overview

launchpad-dev team mailing list archive

Need help and ideas for a weird Poppy problem

 

Hi folks

As you may know, Poppy is the Twisted-based FTP/SFTP server for uploading 
packages to Soyuz. I recently landed a change to fix its logging (along with a 
few other Twisted-based services such as the librarian, branch-puller etc) so 
that it uses the python-oops stuff correctly.

It was released last Friday and within 10 minutes the instance on the PPA 
machine (germanium) went into a weird state where it was unable to contact the 
xmlrpc-private auth service running on the appservers, and hence all SFTP 
requests fail.

Here is an example from the log of an unsuccessful XMLRPC request:

2011-12-16 10:27:38+0000 [SSHService ssh-userauth on 
KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory 
<twisted.web.xmlrpc._QueryFactory instance at 0x7b53248>
... wait ...
2011-12-16 10:28:08+0000 [-] [Failure instance: Traceback (failure with no 
frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused 
connection failure.
        ]
2011-12-16 10:28:08+0000 [-] udienz failed auth publickey
2011-12-16 10:28:08+0000 [-] unauthorized login: unable to get avatar id
2011-12-16 10:28:08+0000 [-] Stopping factory 
<twisted.web.xmlrpc._QueryFactory instance at 0x7b53248>

And here is one that works:

2011-12-16 10:15:11+0000 [SSHService ssh-userauth on 
KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory 
<twisted.web.xmlrpc._QueryFactory instance at 0x6737128>
2011-12-16 10:15:11+0000 [QueryProtocol,client] Stopping factory 
<twisted.web.xmlrpc._QueryFactory instance at 0x6737128>

Because it's a 30 second timeout, this timeout error message is indicative in 
my experience of the TCP SYN package not being ACKed (timeouts for open 
connections are much, much longer).  However, restarting the Poppy instance 
will make things work again, so I'm not sure whether it's a code problem or an 
infrastructure problem.

We are currently running a very old revision of code on germanium so it's 
blocking further rollouts on there.  Oddly, this only affects the PPA machine, 
not the Poppy on cocoplum (the Ubuntu machine).  I've also blasted hundreds of 
connections at the dogfood box to try and make it fail, and it doesn't.  It's 
also worth noting that the instance on germanium also occasionally gets 
problems contacting the keyserver when it's trying to verify GPG signatures, 
which requires a restart to fix.

Since I'm at a total loss as to what to do next, I am going to put the latest 
code back on germanium tomorrow and run it in production again so I can gather 
more data when it goes wrong.

In the meantime, if anyone can come up with any ideas on how to figure out 
what's going on here I'd really appreciate it!

Cheers
J


Follow ups