← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1664299] [NEW] Issue about lost rpc status report from agent.

 

Public bug reported:

Background:
We need a stable and functional public cloud. It means users could launch VM and call openstack API as they want.
So we need the server more strong and strong error-tolerance.

Scenario:
1. Neutron agent report its status through rpc to server side.
2. Alright, the message had been sent by agent. Now it is in message queue.
3. Neutron server take the message from the queue, and will process the payload, but not actually update the agent in db.
4. At the same time, Neutron server restart. That means the rpc message lost. And the agent side will wait for the server response.

In this view, 
if assuming that the max wait time for server response('rpc_response_timeout') is 60s and the max agent DOWN time on Neutron server side is 150s.
As I said background above, users issue the requests in the DOWN time, maybe the destination host which deployed the agent had been selected. The agent side still wait the response from neutron server, but not try asap, just waiting. During launch instances, Neutron server set the agent DOWN, all the instances which host is that will hit binding failed error.

The result is unacceptable in some ways, especially in public products.
Could our neutron solve this issue in some nice ways? :)  Thank you.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1664299

Title:
  Issue about lost rpc status report from agent.

Status in neutron:
  New

Bug description:
  Background:
  We need a stable and functional public cloud. It means users could launch VM and call openstack API as they want.
  So we need the server more strong and strong error-tolerance.

  Scenario:
  1. Neutron agent report its status through rpc to server side.
  2. Alright, the message had been sent by agent. Now it is in message queue.
  3. Neutron server take the message from the queue, and will process the payload, but not actually update the agent in db.
  4. At the same time, Neutron server restart. That means the rpc message lost. And the agent side will wait for the server response.

  In this view, 
  if assuming that the max wait time for server response('rpc_response_timeout') is 60s and the max agent DOWN time on Neutron server side is 150s.
  As I said background above, users issue the requests in the DOWN time, maybe the destination host which deployed the agent had been selected. The agent side still wait the response from neutron server, but not try asap, just waiting. During launch instances, Neutron server set the agent DOWN, all the instances which host is that will hit binding failed error.

  The result is unacceptable in some ways, especially in public
  products. Could our neutron solve this issue in some nice ways? :)
  Thank you.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1664299/+subscriptions