← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1324934] Re: Neutron port leak when connection is dropped during port create in instance boot.

 

** Changed in: nova/icehouse
       Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1324934

Title:
  Neutron port leak when connection is dropped during port create in
  instance boot.

Status in neutron:
  Invalid
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) icehouse series:
  Fix Released

Bug description:
  Sometimes an instance fails to boot because the call to neutron to
  allocate a port fails.  However, we see cases where this happens but
  the port is actually created and allocated to the now-deleted
  instance.

  The same problem has been reported regarding hpcloud internal
  monitoring tools, and the openstack-infra nodepool tenant.  There
  seems to be a port leak.

  Evidence
  ========

  Sometimes instances fail to boot with the following error:

  2014-05-27 00:09:23 ERROR   : [NOV58] NovaServers.add Failed:
  OverLimit - Maximum number of ports exceeded (HTTP 413) (Request-ID:
  req-e05525c3-0876-4da4-8a81-8dcc3432b418)
  args('('SLAM_META_m1_az1_00_09_NC_TEMP',
  u'8c096c29-a666-4b82-99c4-c77dc70cfb40', u'100', 'metastmkey_m1_az1',
  'metastm_m1_az1', u'ee7d6d37-d855-4d30-a67b-0d88a03e72fc', 'az1'),{}')

  How did we run out of ports?  Investigating further, starting with the
  neutron database:

    mysql> select * from ports where device_owner like 'comput%';

  This gives a table which shows neutron ports and the instance uuids
  that they are allocated to (example:
  http://paste.openstack.org/show/82394/)

  Matching neutron's `device_id` with `uuid` in nova's instances table,
  we found that approximately 50% of the ports were allocated to
  instances that had been deleted.  As far as we know this must be a
  bug, as there is no way to create a port without linking it to an
  instance, and deleting an instance should delete its ports atomically.

  The effect is that the user has unused ports counting toward their
  port quota, which will prevent them from booting instances when the
  quota is fully allocated.

  Logs
  ====

  The nova-compute log which relates to an instance that is failing to
  boot because of port starvation is not interesting here.  However we
  have the case where an instance fails to boot for "Neutron error
  creating port", but a port is actually created:

  nova-compute.log:
  2014-05-28 08:08:53.413 16699 DEBUG neutronclient.client [-] throwing ConnectionFailed : [Errno 104] Connection reset by peer _cs_request /usr/lib/python2.7/dist-packages/neutronclient/client.py:153
  2014-05-28 08:08:53.417 16699 ERROR nova.network.neutronv2.api [-] [instance: 2e479806-d13e-4d11-81c1-cc2244a26ef7] Neutron error creating port on network 63657422-b84f-4d2d-b7d2-765ac560546b

  (fuller section of log: http://paste.openstack.org/show/82392/)

  0.2s later, nova-compute.log:
  2014-05-28 08:08:53.664 16699 DEBUG neutronclient.client [-] RESP:{'date': 'Wed, 28 May 2014 08:08:53 GMT', 'status': '200', 'content-length': '13', 'content-type': 'application/json', 'content-location': 'https://region-b.geo-1.network-internal.hpcloudsvc.com/v2.0/ports.json?tenant_id=10409882459003&device_id=2e479806-d13e-4d11-81c1-cc2244a26ef7'} {"ports": []}

  (this is repeated once more after 0.2s longer.  Slightly longer log
  section: http://paste.openstack.org/show/82395/)

  But eventually the port is present in the neutron database:

  +----------------+--------------------------------------+--------------------+--------------------------------------+-------------------+----------------+--------+--------------------------------------+--------------+
  | tenant_id      | id                                   | name               | network_id                           | mac_address       | admin_state_up | status | device_id                            | device_owner |
  +----------------+--------------------------------------+--------------------+--------------------------------------+-------------------+----------------+--------+--------------------------------------+--------------+
  | 10409882459003 | 916cba73-8925-45a2-80e9-6e9d03e602c8 |                    | 63657422-b84f-4d2d-b7d2-765ac560546b | fa:16:3e:a8:7d:14 |              1 | ACTIVE | 2e479806-d13e-4d11-81c1-cc2244a26ef7 | compute:az2  |

  It looks like this port has been leaked by neutron.  Our guess is that
  the "Failed to create port" is spuriously caused by the
  neutronclient's connection being dropped.  In fact the port is being
  created, but it takes some time, and during that time neutron reports
  that there are no ports on that instance, so nothing is cleaned up
  when the instance is deleted.  Then, the port details are actually
  written to the db and the port is leaked.

  Openstack-infra's nodepool was unable to boot instances recently, and
  found several hundred ports in this state.

  Solutions
  =========

  Neither nova nor neutron has enough information to determine which
  ports are leaked - so a periodic task in either of those two services
  would not be possible.

  A user can free up their ports with a script like
  https://gist.github.com/moorryan/93fa4be65fc5ea60b3ed - and I think an
  operator could do the same.  But there is a risk with this script that
  instances/ports which are currently being created could be wrongly
  identified.  So care is needed.

  Neutron synchronizing get_ports calls with create_port (nb I don't
  know the neutron codebase to know how feasible this is).

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1324934/+subscriptions


References