← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1785582] Re: Connectivity to instance after L3 router migration from Legacy to HA fails

 

Reviewed:  https://review.openstack.org/589885
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6c300b1a9b3f0db82b4edd84eda74600d28b7185
Submitter: Zuul
Branch:    master

commit 6c300b1a9b3f0db82b4edd84eda74600d28b7185
Author: Slawek Kaplonski <skaplons@xxxxxxxxxx>
Date:   Wed Aug 8 14:52:06 2018 +0200

    Remove fdb entries for ha router interfaces when going DOWN
    
    When HA router's interface on host is going DOWN but router
    is still available on this host, L2 population
    mechanism driver will now send to other hosts info to remove
    fdb unicast entries to this port on host.
    
    It will not send FLOODING_ENTRY because this port is still on
    host but in standby mode and might be transformed to master
    in future.
    
    This solves issue with migration router from Legacy to HA.
    In such case, port which was originally attached to legacy
    router is transformed to be HA backup port before changing
    its status to DOWN.
    Now in such case unicast entries to this port and backup
    node will be removed properly so packets to HA router will
    be really send to host which is master node for router.
    
    Closes-Bug: #1785582
    
    Change-Id: Icc14e5f5d40fc6fbb49e0f7b18cc3b15ebec8508


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1785582

Title:
  Connectivity to instance after L3 router migration from Legacy to HA
  fails

Status in neutron:
  Fix Released

Bug description:
  Scenario test neutron.tests.tempest.scenario.test_migration.NetworkMigrationFromLegacy.test_from_legacy_to_ha
  fails because of no connectivity to VM after migration.
  We observed it on Pike version mostly but I think that the same issue might be also in newer versions.

  Traceback (most recent call last):
    File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py", line 68, in test_from_legacy_to_ha
      after_dvr=False, after_ha=True)
    File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_migration.py", line 55, in _test_migration
      self._check_connectivity()
    File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/test_dvr.py", line 29, in _check_connectivity
      self.keypair['private_key'])
    File "/usr/lib/python2.7/site-packages/neutron/tests/tempest/scenario/base.py", line 204, in check_connectivity
      ssh_client.test_connection_auth()
    File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, in test_connection_auth
      connection = self._get_ssh_connection()
    File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
      password=self.password)
  tempest.lib.exceptions.SSHTimeout: Connection to the 10.0.0.224 via SSH timed out.
  User: cirros, Password: None

  
  From my investigation it looks that it is because of race between two different operations on router.

  1. router is switched to admin_state down, so port is set to DOWN also,
  2. neutron-server got info from ovs agent that port is down
  3. but now, other thread changes router from legacy to ha so owner of this port changes from DEVICE_OWNER_ROUTER_INTF to DEVICE_OWNER_HA_REPLICATED_INT and also router is still "on" this host (as it's now backup node for router) so in https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L258 l2pop says: ok, I'm not sending remove_fdb_entries to this mac address on this port and old entries are still on other nodes :/ because later when this port is up on different host (new master node) add_fdb_entries is also not send to hosts because of https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/mech_driver.py#L307 which was added in https://github.com/openstack/neutron/commit/26d8702b9d7cc5a4293b97bc435fa85983be9f01

  I tried to run this tests with waiting until router's port will be really down before calling migration to HA and then it passed 151 times for me. So it clearly shows that this is an issue here.
  I think that it should be fixed in neutron's code instead of test as this isn't test-only issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1785582/+subscriptions


References