yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #80939
[Bug 1855919] [NEW] roken pipe erros cause neutron metadata agent to fail
Public bug reported:
After we increased computes to 200, we started seeing "broken pipe"
errors in neutron-metadata-agent.log on the controllers. After a neutron
restart the errors are reduced, then they increase until the log is
mostly errors, and the neutron metadata service fails, and VMs cannot
boot. Another symptom is that unacked RMQ messages build up in the
q-plugin queue. This is the first error we see; this one occurs as the
server is starting:
2019-12-10 10:56:01.942 1838536 INFO eventlet.wsgi.server [-] (1838536) wsgi starting up on http:/var/lib/neutron/metadata_proxy
2019-12-10 10:56:01.943 1838538 INFO eventlet.wsgi.server [-] (1838538) wsgi starting up on http:/var/lib/neutron/metadata_proxy
2019-12-10 10:56:01.945 1838539 INFO eventlet.wsgi.server [-] (1838539) wsgi starting up on http:/var/lib/neutron/metadata_proxy
2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 521, in handle_one_response
write(b''.join(towrite))
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 462, in write
wfile.flush()
File "/usr/lib/python2.7/socket.py", line 307, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size])
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 390, in sendall
tail = self.send(data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 384, in send
return self._send_loop(self.fd.send, data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 371, in _send_loop
return send_method(data, *args)
error: [Errno 32] Broken pipe
2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] 10.195.74.25,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 0 time: 19.0296111
2019-12-10 10:56:25.059 1838516 INFO eventlet.wsgi.server [-] 10.195.74.28,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.2840948
2019-12-10 10:56:25.181 1838529 INFO eventlet.wsgi.server [-] 10.195.74.68,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.2695429
2019-12-10 10:56:25.259 1838518 INFO eventlet.wsgi.server [-] 10.195.74.28,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.1980510
Then we see some "call queues" warnings and the threshold increases to
40:
2019-12-10 10:56:31.414 1838515 WARNING
oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11,
greater than warning threshold: 10. There could be a leak. Increasing
threshold to: 20
Next we see RPC timeout errors:
2019-12-10 10:57:02.043 1838520 WARNING oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11, greater than warning threshold: 10. There could be a leak. Increasing threshold to: 20
2019-12-10 10:57:02.059 1838534 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 37 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 1ed3e021607e466f8b9b84cd3b05b188
2019-12-10 10:57:02.059 1838534 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 1ed3e021607e466f8b9b84cd3b05b188
2019-12-10 10:57:02.285 1838521 INFO eventlet.wsgi.server [-] 10.195.74.27,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7959940
2019-12-10 10:57:16.215 1838531 WARNING
oslo_messaging._drivers.amqpdriver [-] Number of call queues is 21,
greater than warning threshold: 20. There could be a leak. Increasing
threshold to: 40
2019-12-10 10:57:17.339 1838539 WARNING
oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11,
greater than warning threshold: 10. There could be a leak. Increasing
threshold to: 20
2019-12-10 10:57:24.838 1838524 INFO eventlet.wsgi.server [-] 10.195.73.242,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.6842020
2019-12-10 10:57:24.882 1838524 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 3 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:24.883 1838524 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:24.887 1838525 INFO eventlet.wsgi.server [-] 10.195.74.26,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.9827850
2019-12-10 10:57:24.903 1838518 INFO eventlet.wsgi.server [-] 10.195.74.43,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.5630379
2019-12-10 10:57:25.045 1838529 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 21 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID b38361bf9906482b8b24c5b534a6652b
2019-12-10 10:57:25.046 1838529 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID b38361bf9906482b8b24c5b534a6652b
2019-12-10 10:57:25.055 1838537 INFO eventlet.wsgi.server [-] 10.195.73.247,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7542410
2019-12-10 10:57:25.119 1838523 INFO eventlet.wsgi.server [-] 10.195.74.2,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7057869
2019-12-10 10:57:25.185 1838524 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 47 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID f1a268d937f94def97bd238916715744
2019-12-10 10:57:25.261 1838529 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 26 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 93c31cf4f5d34bd1a5ba90165e89cb79
2019-12-10 10:57:25.284 1838536 INFO eventlet.wsgi.server [-] 10.195.73.207,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.4315739
2019-12-10 10:57:25.319 1838520 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 50 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 37c1b168536e4c70b522c330209b11ec
2019-12-10 10:57:25.319 1838520 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 37c1b168536e4c70b522c330209b11ec
2019-12-10 10:57:25.374 1838530 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 30 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID fb837fc73c664209bfbada0fb32886ad
2019-12-10 10:57:25.375 1838530 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID fb837fc73c664209bfbada0fb32886ad
2019-12-10 10:57:25.388 1838526 INFO eventlet.wsgi.server [-] 10.195.65.7,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.5798080
2019-12-10 10:57:25.446 1838520 INFO eventlet.wsgi.server [-] 10.195.74.104,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.6868739
2019-12-10 10:57:25.448 1838528 INFO eventlet.wsgi.server [-] 10.195.74.202,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.7513518
2019-12-10 10:57:25.452 1838519 WARNING oslo_messaging._drivers.amqpdriver [-] Number of call queues is 21, greater than warning threshold: 20. There could be a leak. Increasing threshold to: 40
2019-12-10 10:57:25.504 1838535 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 15 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7b677a7d40274b0ea22510dcf3865cf6
2019-12-10 10:57:25.505 1838535 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 7b677a7d40274b0ea22510dcf3865cf6
2019-12-10 10:57:25.609 1838539 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 20 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 378f11ce14334be38ffaa95ec3fc26f2
2019-12-10 10:57:25.610 1838539 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 378f11ce14334be38ffaa95ec3fc26f2
2019-12-10 10:57:25.661 1838524 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 28 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 0c911a0ac95f42209cfa8b265d4d5c3d
2019-12-10 10:57:25.787 1838525 INFO eventlet.wsgi.server [-] 10.195.74.86,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7191069
2019-12-10 10:57:25.831 1838522 INFO eventlet.wsgi.server [-] 10.195.64.185,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.5980189
2019-12-10 10:57:25.837 1838532 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 51 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 9a7cfd81ba714d2680aa223ba96798f0
2019-12-10 10:57:25.837 1838532 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 9a7cfd81ba714d2680aa223ba96798f0
2019-12-10 10:57:25.903 1838536 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 28 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID da2628209cde4562bf47dc0bdfecbf1d
2019-12-10 10:57:25.904 1838536 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID da2628209cde4562bf47dc0bdfecbf1d
2019-12-10 10:57:25.914 1838521 INFO eventlet.wsgi.server [-] 10.195.74.44,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.8841410
2019-12-10 10:57:25.936 1838524 INFO eventlet.wsgi.server [-] 10.195.73.231,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.3305228
A minute or two after starting the server we get more errors. At this
point VMs are unable to build. If I try to pull metadata from a VM get a
503 or 504 and nothing is logged in neutron-metadata-agent.log. Haproxy
logs the 503/504 response.
albertb@<html><body><h1>503:~ $ curl -s http://169.254.169.254/2009-04-04/meta-data/hostname
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
Now the log is almost all errors:
2019-12-10 10:57:27.666 1838530 INFO eventlet.wsgi.server [-] 10.195.73.174,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.7101889
2019-12-10 10:57:27.719 1838537 INFO eventlet.wsgi.server [-] 10.195.65.6,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.3497119
2019-12-10 10:57:27.720 1838525 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 60 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 138877a7326e40f38de23b05fb97127a
2019-12-10 10:57:27.720 1838525 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 138877a7326e40f38de23b05fb97127a
2019-12-10 10:57:27.741 1838523 INFO eventlet.wsgi.server [-] 10.195.74.86,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 4.7329929
2019-12-10 10:57:27.820 1838525 INFO eventlet.wsgi.server [-] 10.195.73.206,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 1.4146030
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent [-] Unexpected error.: MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent Traceback (most recent call last):
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 89, in __call__
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent instance_id, tenant_id = self._get_instance_and_tenant_id(req)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 162, in _get_instance_and_tenant_id
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent ports = self._get_ports(remote_address, network_id, router_id)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 155, in _get_ports
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self._get_ports_for_remote_address(remote_address, networks)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/common/cache_utils.py", line 116, in __call__
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self.func(target_self, *args, **kwargs)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 137, in _get_ports_for_remote_address
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent ip_address=remote_address)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 106, in _get_ports_from_server
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self.plugin_rpc.get_ports(self.context, filters)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 72, in get_ports
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return cctxt.call(context, 'get_ports', filters=filters)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 173, in call
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent time.sleep(wait)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent self.force_reraise()
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent six.reraise(self.type_, self.value, self.tb)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 150, in call
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self._original_context.call(ctxt, method, **kwargs)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent retry=self.retry)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent retry=retry)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent call_monitor_timeout, retry=retry)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent call_monitor_timeout)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent message = self.waiters.get(msg_id, timeout=timeout)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent 'to message ID %s' % msg_id)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent
2019-12-10 10:57:27.828 1838524 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 521, in handle_one_response
write(b''.join(towrite))
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 462, in write
wfile.flush()
File "/usr/lib/python2.7/socket.py", line 307, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size])
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 390, in sendall
tail = self.send(data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 384, in send
return self._send_loop(self.fd.send, data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 371, in _send_loop
return send_method(data, *args)
error: [Errno 32] Broken pipe
2019-12-10 10:57:27.828 1838524 INFO eventlet.wsgi.server [-] 10.195.73.248,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 500 len: 0 time: 63.0060959
2019-12-10 10:57:27.873 1838528 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 41 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7d73358e2fe841e4a4b818395e2e5b2d
2019-12-10 10:57:27.877 1838524 INFO eventlet.wsgi.server [-] 10.195.73.238,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 4.4559531
2019-12-10 10:57:27.921 1838538 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 6 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 297db0c16653413cabc868027f9e6abb
2019-12-10 10:57:27.921 1838538 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 297db0c16653413cabc868027f9e6abb
2019-12-10 10:57:27.967 1838520 INFO eventlet.wsgi.server [-] 10.195.74.29,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.4040241
2019-12-10 10:57:28.006 1838517 INFO eventlet.wsgi.server [-] 10.195.74.202,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.6681471
2019-12-10 10:57:28.026 1838522 INFO eventlet.wsgi.server [-] 10.195.74.202,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.3529530
2019-12-10 10:57:28.058 1838519 INFO eventlet.wsgi.server [-] 10.195.74.121,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.5390451
To reproduce this issue:
Build openstack cluster on Rocky and add 200 computes. 3 controllers with 48 CPU Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz and 92G RAM.
This bug seems severe to us. It is ruining our production cluster and we cannot build VMs.
** Affects: neutron
Importance: Undecided
Status: New
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1855919
Title:
roken pipe erros cause neutron metadata agent to fail
Status in neutron:
New
Bug description:
After we increased computes to 200, we started seeing "broken pipe"
errors in neutron-metadata-agent.log on the controllers. After a
neutron restart the errors are reduced, then they increase until the
log is mostly errors, and the neutron metadata service fails, and VMs
cannot boot. Another symptom is that unacked RMQ messages build up in
the q-plugin queue. This is the first error we see; this one occurs as
the server is starting:
2019-12-10 10:56:01.942 1838536 INFO eventlet.wsgi.server [-] (1838536) wsgi starting up on http:/var/lib/neutron/metadata_proxy
2019-12-10 10:56:01.943 1838538 INFO eventlet.wsgi.server [-] (1838538) wsgi starting up on http:/var/lib/neutron/metadata_proxy
2019-12-10 10:56:01.945 1838539 INFO eventlet.wsgi.server [-] (1838539) wsgi starting up on http:/var/lib/neutron/metadata_proxy
2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 521, in handle_one_response
write(b''.join(towrite))
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 462, in write
wfile.flush()
File "/usr/lib/python2.7/socket.py", line 307, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size])
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 390, in sendall
tail = self.send(data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 384, in send
return self._send_loop(self.fd.send, data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 371, in _send_loop
return send_method(data, *args)
error: [Errno 32] Broken pipe
2019-12-10 10:56:21.138 1838538 INFO eventlet.wsgi.server [-] 10.195.74.25,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 0 time: 19.0296111
2019-12-10 10:56:25.059 1838516 INFO eventlet.wsgi.server [-] 10.195.74.28,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.2840948
2019-12-10 10:56:25.181 1838529 INFO eventlet.wsgi.server [-] 10.195.74.68,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.2695429
2019-12-10 10:56:25.259 1838518 INFO eventlet.wsgi.server [-] 10.195.74.28,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.1980510
Then we see some "call queues" warnings and the threshold increases to
40:
2019-12-10 10:56:31.414 1838515 WARNING
oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11,
greater than warning threshold: 10. There could be a leak. Increasing
threshold to: 20
Next we see RPC timeout errors:
2019-12-10 10:57:02.043 1838520 WARNING oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11, greater than warning threshold: 10. There could be a leak. Increasing threshold to: 20
2019-12-10 10:57:02.059 1838534 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 37 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 1ed3e021607e466f8b9b84cd3b05b188
2019-12-10 10:57:02.059 1838534 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 1ed3e021607e466f8b9b84cd3b05b188
2019-12-10 10:57:02.285 1838521 INFO eventlet.wsgi.server [-] 10.195.74.27,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7959940
2019-12-10 10:57:16.215 1838531 WARNING
oslo_messaging._drivers.amqpdriver [-] Number of call queues is 21,
greater than warning threshold: 20. There could be a leak. Increasing
threshold to: 40
2019-12-10 10:57:17.339 1838539 WARNING
oslo_messaging._drivers.amqpdriver [-] Number of call queues is 11,
greater than warning threshold: 10. There could be a leak. Increasing
threshold to: 20
2019-12-10 10:57:24.838 1838524 INFO eventlet.wsgi.server [-] 10.195.73.242,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.6842020
2019-12-10 10:57:24.882 1838524 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 3 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:24.883 1838524 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:24.887 1838525 INFO eventlet.wsgi.server [-] 10.195.74.26,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.9827850
2019-12-10 10:57:24.903 1838518 INFO eventlet.wsgi.server [-] 10.195.74.43,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.5630379
2019-12-10 10:57:25.045 1838529 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 21 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID b38361bf9906482b8b24c5b534a6652b
2019-12-10 10:57:25.046 1838529 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID b38361bf9906482b8b24c5b534a6652b
2019-12-10 10:57:25.055 1838537 INFO eventlet.wsgi.server [-] 10.195.73.247,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7542410
2019-12-10 10:57:25.119 1838523 INFO eventlet.wsgi.server [-] 10.195.74.2,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7057869
2019-12-10 10:57:25.185 1838524 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 47 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID f1a268d937f94def97bd238916715744
2019-12-10 10:57:25.261 1838529 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 26 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 93c31cf4f5d34bd1a5ba90165e89cb79
2019-12-10 10:57:25.284 1838536 INFO eventlet.wsgi.server [-] 10.195.73.207,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.4315739
2019-12-10 10:57:25.319 1838520 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 50 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 37c1b168536e4c70b522c330209b11ec
2019-12-10 10:57:25.319 1838520 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 37c1b168536e4c70b522c330209b11ec
2019-12-10 10:57:25.374 1838530 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 30 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID fb837fc73c664209bfbada0fb32886ad
2019-12-10 10:57:25.375 1838530 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID fb837fc73c664209bfbada0fb32886ad
2019-12-10 10:57:25.388 1838526 INFO eventlet.wsgi.server [-] 10.195.65.7,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.5798080
2019-12-10 10:57:25.446 1838520 INFO eventlet.wsgi.server [-] 10.195.74.104,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.6868739
2019-12-10 10:57:25.448 1838528 INFO eventlet.wsgi.server [-] 10.195.74.202,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.7513518
2019-12-10 10:57:25.452 1838519 WARNING oslo_messaging._drivers.amqpdriver [-] Number of call queues is 21, greater than warning threshold: 20. There could be a leak. Increasing threshold to: 40
2019-12-10 10:57:25.504 1838535 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 15 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7b677a7d40274b0ea22510dcf3865cf6
2019-12-10 10:57:25.505 1838535 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 7b677a7d40274b0ea22510dcf3865cf6
2019-12-10 10:57:25.609 1838539 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 20 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 378f11ce14334be38ffaa95ec3fc26f2
2019-12-10 10:57:25.610 1838539 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 378f11ce14334be38ffaa95ec3fc26f2
2019-12-10 10:57:25.661 1838524 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 28 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 0c911a0ac95f42209cfa8b265d4d5c3d
2019-12-10 10:57:25.787 1838525 INFO eventlet.wsgi.server [-] 10.195.74.86,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.7191069
2019-12-10 10:57:25.831 1838522 INFO eventlet.wsgi.server [-] 10.195.64.185,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.5980189
2019-12-10 10:57:25.837 1838532 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 51 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 9a7cfd81ba714d2680aa223ba96798f0
2019-12-10 10:57:25.837 1838532 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 9a7cfd81ba714d2680aa223ba96798f0
2019-12-10 10:57:25.903 1838536 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 28 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID da2628209cde4562bf47dc0bdfecbf1d
2019-12-10 10:57:25.904 1838536 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID da2628209cde4562bf47dc0bdfecbf1d
2019-12-10 10:57:25.914 1838521 INFO eventlet.wsgi.server [-] 10.195.74.44,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.8841410
2019-12-10 10:57:25.936 1838524 INFO eventlet.wsgi.server [-] 10.195.73.231,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.3305228
A minute or two after starting the server we get more errors. At this
point VMs are unable to build. If I try to pull metadata from a VM get
a 503 or 504 and nothing is logged in neutron-metadata-agent.log.
Haproxy logs the 503/504 response.
albertb@<html><body><h1>503:~ $ curl -s http://169.254.169.254/2009-04-04/meta-data/hostname
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
Now the log is almost all errors:
2019-12-10 10:57:27.666 1838530 INFO eventlet.wsgi.server [-] 10.195.73.174,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.7101889
2019-12-10 10:57:27.719 1838537 INFO eventlet.wsgi.server [-] 10.195.65.6,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.3497119
2019-12-10 10:57:27.720 1838525 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 60 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 138877a7326e40f38de23b05fb97127a
2019-12-10 10:57:27.720 1838525 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 138877a7326e40f38de23b05fb97127a
2019-12-10 10:57:27.741 1838523 INFO eventlet.wsgi.server [-] 10.195.74.86,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 4.7329929
2019-12-10 10:57:27.820 1838525 INFO eventlet.wsgi.server [-] 10.195.73.206,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 1.4146030
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent [-] Unexpected error.: MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent Traceback (most recent call last):
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 89, in __call__
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent instance_id, tenant_id = self._get_instance_and_tenant_id(req)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 162, in _get_instance_and_tenant_id
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent ports = self._get_ports(remote_address, network_id, router_id)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 155, in _get_ports
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self._get_ports_for_remote_address(remote_address, networks)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/common/cache_utils.py", line 116, in __call__
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self.func(target_self, *args, **kwargs)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 137, in _get_ports_for_remote_address
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent ip_address=remote_address)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 106, in _get_ports_from_server
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self.plugin_rpc.get_ports(self.context, filters)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/metadata/agent.py", line 72, in get_ports
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return cctxt.call(context, 'get_ports', filters=filters)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 173, in call
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent time.sleep(wait)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent self.force_reraise()
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent six.reraise(self.type_, self.value, self.tb)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/neutron/common/rpc.py", line 150, in call
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent return self._original_context.call(ctxt, method, **kwargs)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent retry=self.retry)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent retry=retry)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent call_monitor_timeout, retry=retry)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent call_monitor_timeout)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent message = self.waiters.get(msg_id, timeout=timeout)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent 'to message ID %s' % msg_id)
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent MessagingTimeout: Timed out waiting for a reply to message ID 2bb5faa3ec8d4f5b9d3bd3e2fe095f9e
2019-12-10 10:57:27.824 1838524 ERROR neutron.agent.metadata.agent
2019-12-10 10:57:27.828 1838524 INFO eventlet.wsgi.server [-] Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 521, in handle_one_response
write(b''.join(towrite))
File "/usr/lib/python2.7/dist-packages/eventlet/wsgi.py", line 462, in write
wfile.flush()
File "/usr/lib/python2.7/socket.py", line 307, in flush
self._sock.sendall(view[write_offset:write_offset+buffer_size])
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 390, in sendall
tail = self.send(data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 384, in send
return self._send_loop(self.fd.send, data, flags)
File "/usr/lib/python2.7/dist-packages/eventlet/greenio/base.py", line 371, in _send_loop
return send_method(data, *args)
error: [Errno 32] Broken pipe
2019-12-10 10:57:27.828 1838524 INFO eventlet.wsgi.server [-] 10.195.73.248,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 500 len: 0 time: 63.0060959
2019-12-10 10:57:27.873 1838528 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 41 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7d73358e2fe841e4a4b818395e2e5b2d
2019-12-10 10:57:27.877 1838524 INFO eventlet.wsgi.server [-] 10.195.73.238,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 4.4559531
2019-12-10 10:57:27.921 1838538 ERROR neutron.common.rpc [-] Timeout in RPC method get_ports. Waiting for 6 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 297db0c16653413cabc868027f9e6abb
2019-12-10 10:57:27.921 1838538 WARNING neutron.common.rpc [-] Increasing timeout for get_ports calls to 120 seconds. Restart the agent to restore it to the default value.: MessagingTimeout: Timed out waiting for a reply to message ID 297db0c16653413cabc868027f9e6abb
2019-12-10 10:57:27.967 1838520 INFO eventlet.wsgi.server [-] 10.195.74.29,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.4040241
2019-12-10 10:57:28.006 1838517 INFO eventlet.wsgi.server [-] 10.195.74.202,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.6681471
2019-12-10 10:57:28.026 1838522 INFO eventlet.wsgi.server [-] 10.195.74.202,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 0.3529530
2019-12-10 10:57:28.058 1838519 INFO eventlet.wsgi.server [-] 10.195.74.121,<local> "GET /latest/meta-data/instance-id HTTP/1.0" status: 200 len: 146 time: 3.5390451
To reproduce this issue:
Build openstack cluster on Rocky and add 200 computes. 3 controllers with 48 CPU Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz and 92G RAM.
This bug seems severe to us. It is ruining our production cluster and we cannot build VMs.
To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1855919/+subscriptions
Follow ups