launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #08716
Need help and ideas for a weird Poppy problem
Hi folks
As you may know, Poppy is the Twisted-based FTP/SFTP server for uploading
packages to Soyuz. I recently landed a change to fix its logging (along with a
few other Twisted-based services such as the librarian, branch-puller etc) so
that it uses the python-oops stuff correctly.
It was released last Friday and within 10 minutes the instance on the PPA
machine (germanium) went into a weird state where it was unable to contact the
xmlrpc-private auth service running on the appservers, and hence all SFTP
requests fail.
Here is an example from the log of an unsuccessful XMLRPC request:
2011-12-16 10:27:38+0000 [SSHService ssh-userauth on
KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory
<twisted.web.xmlrpc._QueryFactory instance at 0x7b53248>
... wait ...
2011-12-16 10:28:08+0000 [-] [Failure instance: Traceback (failure with no
frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused
connection failure.
]
2011-12-16 10:28:08+0000 [-] udienz failed auth publickey
2011-12-16 10:28:08+0000 [-] unauthorized login: unable to get avatar id
2011-12-16 10:28:08+0000 [-] Stopping factory
<twisted.web.xmlrpc._QueryFactory instance at 0x7b53248>
And here is one that works:
2011-12-16 10:15:11+0000 [SSHService ssh-userauth on
KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory
<twisted.web.xmlrpc._QueryFactory instance at 0x6737128>
2011-12-16 10:15:11+0000 [QueryProtocol,client] Stopping factory
<twisted.web.xmlrpc._QueryFactory instance at 0x6737128>
Because it's a 30 second timeout, this timeout error message is indicative in
my experience of the TCP SYN package not being ACKed (timeouts for open
connections are much, much longer). However, restarting the Poppy instance
will make things work again, so I'm not sure whether it's a code problem or an
infrastructure problem.
We are currently running a very old revision of code on germanium so it's
blocking further rollouts on there. Oddly, this only affects the PPA machine,
not the Poppy on cocoplum (the Ubuntu machine). I've also blasted hundreds of
connections at the dogfood box to try and make it fail, and it doesn't. It's
also worth noting that the instance on germanium also occasionally gets
problems contacting the keyserver when it's trying to verify GPG signatures,
which requires a restart to fix.
Since I'm at a total loss as to what to do next, I am going to put the latest
code back on germanium tomorrow and run it in production again so I can gather
more data when it goes wrong.
In the meantime, if anyone can come up with any ideas on how to figure out
what's going on here I'd really appreciate it!
Cheers
J
Follow ups