← Back to team overview

openstack team mailing list archive

Nova RPC Timeout and lost QPID connection

 


We're running into a problem periodically where we lose our qpid connection
for one of our compute services.  We're on Folsom in a 2-node setup with
the compute services running on one node and qpidd, scheduler, network,
etc., running on the other.

We've scaled this environment up to where we have 2800 instances created.
When we hit this problem, the scheduler continues to get updates from the
compute service so the service is still active, however looking at the qpid
queues with "qpid-config queues", we see that the queue no longer exists
and the compute service no longer receives spawn requests.  The scheduler
continues to select this compute service for new boot requests which get
stuck in BUILD state.

I have a trace here on pastebin http://pastebin.com/rDid7Egm

The first error appears to be an RPC Timeout "Timed out waiting for RPC
response: None "  followed by an AssertionError in the
qpid/messaging/driver.py.

Any ideas about what might be happening would be appreciated.  Also if you
have thoughts on how to debug this further I'd love to hear them.
Thanks!
-Paul