← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1817872] [NEW] [RFE] neutron resource health check

 

Public bug reported:

Problem Description
===================
How to do trouble shooting if one vm lost the connection? How to find out the problem why the floating IP is not connectable?
No easy way, cloud operators need to dump the flows or iptables rules for it, and then find out which parts was not set properly. What if there are huge amounts of flows or rules, it is not human-readable, how to find out what happened to that port? When there are plenty iptables rules, how to find out why floating IP is not reachable? When there are many routers hosted in one same agent node, how to find out why router is not up?
Each one seems unfriendly to mankind. And people make mistakes. But we have the resource process procedure, so we can follow that workflow to let the machine do the status check/trouble shooting/recovery for us.

Proposed Change
===============
This will aim to the community goal "Service-side health checks".
http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000558.html

And we already have that trouble shooting BP:
https://blueprints.launchpad.net/neutron/+spec/troubleshooting
seems we do not have much progress.

Overview
--------
Add some API, CLI tools, agent side functions to check resource status.


Basic plan:
1. In the agent side, adds some functions to detect the status of one single resource.
For instance, check router iptables rules, check router route rules; for ports, check the basic flow status, check the openflow security group, l2 pop, arp, etc.
2. bulk check, ports for a tenant, or ports from one subnet, routers for a tenant
3. check resources of one entire agent
4. API extension for the related resource, such as, router_check, port_check
For some automatically scenario, cloud operators may not want to login the neutron-server host, then the API can be a good way to call these check methods.


Implement plan:
1. adds some functions to detect the status of one single resource.
For instance, according to the router process procesure, add check methods for each step:  check_router_gateway, check_nat_rules, check_route_rules, check_qos_rules, check_meta_proxy, and so on.
2. CLI tool (cloud admin only, needs to run in neutron server host with directly access of DB) to check resources of one entire agent.
For instance, check the routers of one l3 agent.
3. API extension for the related resource, check_router, check_port


---------------
to be continued...

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1817872

Title:
  [RFE] neutron resource health check

Status in neutron:
  New

Bug description:
  Problem Description
  ===================
  How to do trouble shooting if one vm lost the connection? How to find out the problem why the floating IP is not connectable?
  No easy way, cloud operators need to dump the flows or iptables rules for it, and then find out which parts was not set properly. What if there are huge amounts of flows or rules, it is not human-readable, how to find out what happened to that port? When there are plenty iptables rules, how to find out why floating IP is not reachable? When there are many routers hosted in one same agent node, how to find out why router is not up?
  Each one seems unfriendly to mankind. And people make mistakes. But we have the resource process procedure, so we can follow that workflow to let the machine do the status check/trouble shooting/recovery for us.

  Proposed Change
  ===============
  This will aim to the community goal "Service-side health checks".
  http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000558.html

  And we already have that trouble shooting BP:
  https://blueprints.launchpad.net/neutron/+spec/troubleshooting
  seems we do not have much progress.

  Overview
  --------
  Add some API, CLI tools, agent side functions to check resource status.

  
  Basic plan:
  1. In the agent side, adds some functions to detect the status of one single resource.
  For instance, check router iptables rules, check router route rules; for ports, check the basic flow status, check the openflow security group, l2 pop, arp, etc.
  2. bulk check, ports for a tenant, or ports from one subnet, routers for a tenant
  3. check resources of one entire agent
  4. API extension for the related resource, such as, router_check, port_check
  For some automatically scenario, cloud operators may not want to login the neutron-server host, then the API can be a good way to call these check methods.

  
  Implement plan:
  1. adds some functions to detect the status of one single resource.
  For instance, according to the router process procesure, add check methods for each step:  check_router_gateway, check_nat_rules, check_route_rules, check_qos_rules, check_meta_proxy, and so on.
  2. CLI tool (cloud admin only, needs to run in neutron server host with directly access of DB) to check resources of one entire agent.
  For instance, check the routers of one l3 agent.
  3. API extension for the related resource, check_router, check_port


  ---------------
  to be continued...

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1817872/+subscriptions


Follow ups