← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1519926] [NEW] L3-agent restart causes VM connectivity loss

 

Public bug reported:

L3-agent restart causes VM connectivity loss

To test whether a the L3-agent on a network node can recover after a it
was stopped and then restarted. I ran this test on a devstack setup
using the latest neutron code on the master branch.  The L3-agent is
running in legacy mode.

1. Create a network, subnetwork.
2. Create a router, tie the router to the subnetwork and the external network.
3. Create a VM using the network and assign a floating IP to the VM.  The VM can be pinged and ssh'ed using the floating IP.
4. On the controller node, kill the L3 agent.
5. Delete the qrouter namespace of the router created in (2) on the controller node.
6. Start up the L3-agent again.
7. Now the VM can no longer be ssh'ed using the FIP.


The VM connectivity is lost to the VM because the L3-agent failed to reconstruct all the interfaces in the qrouter namespace.  For example:

Before running steps 4-6, the qrouter namespace on the controller node looks like (router-id=e86b277a-5f49-4fcb-8d85-241594db418e, VM's FIP=10.127.10.5):
stack@Ubuntu-38:~/DEVSTACK/demo$ sudo ip netns exec qrouter-e86b277a-5f49-4fcb-8d85-241594db418e ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
33: qr-50b99abf-a4: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether fa:16:3e:17:3e:b0 brd ff:ff:ff:ff:ff:ff
    inet 10.1.2.1/24 brd 10.1.2.255 scope global qr-50b99abf-a4
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe17:3eb0/64 scope link
       valid_lft forever preferred_lft forever
34: qg-3d1a888a-33: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether fa:16:3e:60:9a:43 brd ff:ff:ff:ff:ff:ff
    inet 10.127.10.4/24 brd 10.127.10.255 scope global qg-3d1a888a-33
       valid_lft forever preferred_lft forever
    inet 10.127.10.5/32 brd 10.127.10.5 scope global qg-3d1a888a-33
       valid_lft forever preferred_lft forever
    inet6 2001:db8::3/64 scope global
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe60:9a43/64 scope link
       valid_lft forever preferred_lft forever


After deleting the qrouter-e86b277a-5f49-4fcb-8d85-241594db418e namespace and then restarting the L3-agent on the controller node, the L3-agent did recreate the namespace again, however, not all the interfaces and IP addresses are created:

stack@Ubuntu-38:~/DEVSTACK/demo$ sudo ip netns exec qrouter-e86b277a-5f49-4fcb-8d85-241594db418e ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

So the VM can't be ssh'ed because all the required plumbing is not re-
created.

When the L3 agent is running in dvr-snat mode on the controller and dvr
on the compute node, if I do steps 4-6 on the compute node, the VM will
no longer be ssh'ed either.  The qrouter namespace doesn't have all the
needed interfaces either.

Also if I ran the same test using neutron based on stable/liberty or
stable/kilo, after running steps 4-6, the VM can still be ssh'ed after
the L3-agent restart.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1519926

Title:
  L3-agent restart causes VM connectivity loss

Status in neutron:
  New

Bug description:
  L3-agent restart causes VM connectivity loss

  To test whether a the L3-agent on a network node can recover after a
  it was stopped and then restarted. I ran this test on a devstack setup
  using the latest neutron code on the master branch.  The L3-agent is
  running in legacy mode.

  1. Create a network, subnetwork.
  2. Create a router, tie the router to the subnetwork and the external network.
  3. Create a VM using the network and assign a floating IP to the VM.  The VM can be pinged and ssh'ed using the floating IP.
  4. On the controller node, kill the L3 agent.
  5. Delete the qrouter namespace of the router created in (2) on the controller node.
  6. Start up the L3-agent again.
  7. Now the VM can no longer be ssh'ed using the FIP.

  
  The VM connectivity is lost to the VM because the L3-agent failed to reconstruct all the interfaces in the qrouter namespace.  For example:

  Before running steps 4-6, the qrouter namespace on the controller node looks like (router-id=e86b277a-5f49-4fcb-8d85-241594db418e, VM's FIP=10.127.10.5):
  stack@Ubuntu-38:~/DEVSTACK/demo$ sudo ip netns exec qrouter-e86b277a-5f49-4fcb-8d85-241594db418e ip a
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      inet 127.0.0.1/8 scope host lo
         valid_lft forever preferred_lft forever
      inet6 ::1/128 scope host
         valid_lft forever preferred_lft forever
  33: qr-50b99abf-a4: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
      link/ether fa:16:3e:17:3e:b0 brd ff:ff:ff:ff:ff:ff
      inet 10.1.2.1/24 brd 10.1.2.255 scope global qr-50b99abf-a4
         valid_lft forever preferred_lft forever
      inet6 fe80::f816:3eff:fe17:3eb0/64 scope link
         valid_lft forever preferred_lft forever
  34: qg-3d1a888a-33: <BROADCAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
      link/ether fa:16:3e:60:9a:43 brd ff:ff:ff:ff:ff:ff
      inet 10.127.10.4/24 brd 10.127.10.255 scope global qg-3d1a888a-33
         valid_lft forever preferred_lft forever
      inet 10.127.10.5/32 brd 10.127.10.5 scope global qg-3d1a888a-33
         valid_lft forever preferred_lft forever
      inet6 2001:db8::3/64 scope global
         valid_lft forever preferred_lft forever
      inet6 fe80::f816:3eff:fe60:9a43/64 scope link
         valid_lft forever preferred_lft forever

  
  After deleting the qrouter-e86b277a-5f49-4fcb-8d85-241594db418e namespace and then restarting the L3-agent on the controller node, the L3-agent did recreate the namespace again, however, not all the interfaces and IP addresses are created:

  stack@Ubuntu-38:~/DEVSTACK/demo$ sudo ip netns exec qrouter-e86b277a-5f49-4fcb-8d85-241594db418e ip a
  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
      link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      inet 127.0.0.1/8 scope host lo
         valid_lft forever preferred_lft forever
      inet6 ::1/128 scope host
         valid_lft forever preferred_lft forever

  So the VM can't be ssh'ed because all the required plumbing is not re-
  created.

  When the L3 agent is running in dvr-snat mode on the controller and
  dvr on the compute node, if I do steps 4-6 on the compute node, the VM
  will no longer be ssh'ed either.  The qrouter namespace doesn't have
  all the needed interfaces either.

  Also if I ran the same test using neutron based on stable/liberty or
  stable/kilo, after running steps 4-6, the VM can still be ssh'ed after
  the L3-agent restart.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1519926/+subscriptions


Follow ups