yahoo-eng-team team mailing list archive
  
  - 
     yahoo-eng-team team yahoo-eng-team team
- 
    Mailing list archive
  
- 
    Message #88440
  
 [Bug 1964149] [NEW] nova dns lookups can block the	nova api process leading to 503 errors.
  
Public bug reported:
we currently have 4 possibly related downstream bugs whereby DNS lookups can
result in 503 errors as we do not monkey patch green DNS and that can result in blocking behavior.
specifically we have seen callses to  socket.getaddrinfo in py-amqp block the API
when using ipv6.
https://bugzilla.redhat.com/show_bug.cgi?id=2037690
https://bugzilla.redhat.com/show_bug.cgi?id=2050867
https://bugzilla.redhat.com/show_bug.cgi?id=2051631
https://bugzilla.redhat.com/show_bug.cgi?id=2056504
copying  a summary of the rca 
from one of the bugs
What happens:
- A request comes in which requires rpc, so a new connection to rabbitmq
is to be established
- The hostname(s) from the transport_url setting are ultimately passed
to py-amqp, which attempts to resolve the hostname to an ip address so
it can set up the underlying socket and connect
- py-amqp explicitly tries to resolve with AF_INET first and then only
if that fails, then it tries with AF_INET6[1]
- The customer environment is primarily IPv6.  Attempting to resolve the
hostname via AF_INET fails nss_hosts (the /etc/hosts file only have IPv6
addrs), and falls through to nss_dns
- Something about the customer DNS infrastructure is slow, so it takes a
long time (~10 seconds) for this IPv4-lookup to fail.
- py-amqp finally tries with AF_INET6 and the hostname is resolved
immediately via nss_hosts because the entry is in the /etc/hosts
Critically, because nova explicitly disables greendns[2] with eventlet, the *entire* nova-api worker is blocked during the duration of the slow name resolution, because socket.getaddrinfo is a blocking call into glibc.
[1] https://github.com/celery/py-amqp/blob/1f599c7213b097df07d0afd7868072ff9febf4da/amqp/transport.py#L155-L208
[2] https://github.com/openstack/nova/blob/master/nova/monkey_patch.py#L25-L36
nova currently disables greendns monkeypatch because of a very old bug on centos 6 on python 2.6 and the havana release of nova https://bugs.launchpad.net/nova/+bug/1164822
ipv6 support was added in  v0.17 in the same release that added python 3 support back in 2015
https://github.com/eventlet/eventlet/issues/8#issuecomment-75490457
so we should not need to work around the lack of ipv6 support anymore.
https://review.opendev.org/c/openstack/nova/+/830966
** Affects: nova
     Importance: Medium
     Assignee: sean mooney (sean-k-mooney)
         Status: Triaged
** Tags: api yoga-rc-potential
** Changed in: nova
   Importance: Undecided => Medium
** Changed in: nova
       Status: New => Triaged
** Changed in: nova
     Assignee: (unassigned) => sean mooney (sean-k-mooney)
** Tags added: api yoga-rc-potential
-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1964149
Title:
  nova dns lookups can block the nova api process leading to 503 errors.
Status in OpenStack Compute (nova):
  Triaged
Bug description:
  we currently have 4 possibly related downstream bugs whereby DNS lookups can
  result in 503 errors as we do not monkey patch green DNS and that can result in blocking behavior.
  specifically we have seen callses to  socket.getaddrinfo in py-amqp block the API
  when using ipv6.
  https://bugzilla.redhat.com/show_bug.cgi?id=2037690
  https://bugzilla.redhat.com/show_bug.cgi?id=2050867
  https://bugzilla.redhat.com/show_bug.cgi?id=2051631
  https://bugzilla.redhat.com/show_bug.cgi?id=2056504
  
  copying  a summary of the rca 
  from one of the bugs
  What happens:
  - A request comes in which requires rpc, so a new connection to
  rabbitmq is to be established
  - The hostname(s) from the transport_url setting are ultimately passed
  to py-amqp, which attempts to resolve the hostname to an ip address so
  it can set up the underlying socket and connect
  - py-amqp explicitly tries to resolve with AF_INET first and then only
  if that fails, then it tries with AF_INET6[1]
  - The customer environment is primarily IPv6.  Attempting to resolve
  the hostname via AF_INET fails nss_hosts (the /etc/hosts file only
  have IPv6 addrs), and falls through to nss_dns
  - Something about the customer DNS infrastructure is slow, so it takes
  a long time (~10 seconds) for this IPv4-lookup to fail.
  - py-amqp finally tries with AF_INET6 and the hostname is resolved
  immediately via nss_hosts because the entry is in the /etc/hosts
  
  Critically, because nova explicitly disables greendns[2] with eventlet, the *entire* nova-api worker is blocked during the duration of the slow name resolution, because socket.getaddrinfo is a blocking call into glibc.
  [1] https://github.com/celery/py-amqp/blob/1f599c7213b097df07d0afd7868072ff9febf4da/amqp/transport.py#L155-L208
  [2] https://github.com/openstack/nova/blob/master/nova/monkey_patch.py#L25-L36
  
  nova currently disables greendns monkeypatch because of a very old bug on centos 6 on python 2.6 and the havana release of nova https://bugs.launchpad.net/nova/+bug/1164822
  ipv6 support was added in  v0.17 in the same release that added python 3 support back in 2015
  https://github.com/eventlet/eventlet/issues/8#issuecomment-75490457
  so we should not need to work around the lack of ipv6 support anymore.
  https://review.opendev.org/c/openstack/nova/+/830966
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1964149/+subscriptions
Follow ups