← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1520517] [NEW] IPv6 networking badly broken by garp_master_* keepalived settings

 

Public bug reported:

http://git.openstack.org/cgit/openstack/neutron/commit/?id=5d38dc5 added
the "garp_master_repeat 5" and "garp_master_refresh 10". This badly
breaks networking to the point where it is completely unsuable:

First of all, this setting causes Keepalived to constantly spam five
unsolicited neighbour advertisements every ten seconds, as shown in this
tcpdump (the active router is fe80::f816:3eff:feb8:3857):

ubuntu@test:~$ sudo tcpdump -i eth0 host fe80::f816:3eff:feb8:3857 
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
10:08:09.459090 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, router advertisement, length 56
10:08:12.566305 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, router advertisement, length 56
10:08:15.638044 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, router advertisement, length 56
10:08:18.039185 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
10:08:18.039275 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
10:08:18.039496 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
10:08:18.039581 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
10:08:18.039595 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
10:08:21.952451 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, router advertisement, length 56
10:08:28.046863 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
10:08:28.046944 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
10:08:28.047045 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
10:08:28.047986 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
10:08:28.048033 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
10:08:30.931114 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, router advertisement, length 56

That's not a problem in itself. The problem is however that these
neighbour advertisements causes the instance to loose its default route,
which stays gone until the next router advertisement packet arrives:

ubuntu@test:~$ sudo ip -6 monitor route | ts
Nov 27 10:08:09 default via fe80::f816:3eff:feb8:3857 dev eth0  proto ra  metric 1024 
Nov 27 10:08:18 Deleted default via fe80::f816:3eff:feb8:3857 dev eth0  proto ra  metric 1024  expires 27sec hoplimit 64
Nov 27 10:08:21 default via fe80::f816:3eff:feb8:3857 dev eth0  proto ra  metric 1024 
Nov 27 10:08:28 Deleted default via fe80::f816:3eff:feb8:3857 dev eth0  proto ra  metric 1024  expires 23sec hoplimit 64
Nov 27 10:08:30 default via fe80::f816:3eff:feb8:3857 dev eth0  proto ra  metric 1024 

These periods without a default route obviously badly breaks network
connectivity for the node, which is easily observable by a simple ping
test:

ubuntu@test:~$ ping6 2a02:c0::1 2>&1 | ts
Nov 27 10:08:15 PING 2a02:c0::1(2a02:c0::1) 56 data bytes
Nov 27 10:08:15 64 bytes from 2a02:c0::1: icmp_seq=1 ttl=62 time=0.570 ms
Nov 27 10:08:16 64 bytes from 2a02:c0::1: icmp_seq=2 ttl=62 time=0.873 ms
Nov 27 10:08:17 64 bytes from 2a02:c0::1: icmp_seq=3 ttl=62 time=0.666 ms
Nov 27 10:08:18 ping: sendmsg: Network is unreachable
Nov 27 10:08:19 ping: sendmsg: Network is unreachable
Nov 27 10:08:20 ping: sendmsg: Network is unreachable
Nov 27 10:08:21 ping: sendmsg: Network is unreachable
Nov 27 10:08:22 64 bytes from 2a02:c0::1: icmp_seq=8 ttl=62 time=1.42 ms
Nov 27 10:08:23 64 bytes from 2a02:c0::1: icmp_seq=9 ttl=62 time=0.785 ms
Nov 27 10:08:24 64 bytes from 2a02:c0::1: icmp_seq=10 ttl=62 time=0.712 ms
Nov 27 10:08:25 64 bytes from 2a02:c0::1: icmp_seq=11 ttl=62 time=0.724 ms
Nov 27 10:08:26 64 bytes from 2a02:c0::1: icmp_seq=12 ttl=62 time=0.864 ms
Nov 27 10:08:27 64 bytes from 2a02:c0::1: icmp_seq=13 ttl=62 time=0.652 ms
Nov 27 10:08:28 ping: sendmsg: Network is unreachable
Nov 27 10:08:29 ping: sendmsg: Network is unreachable
Nov 27 10:08:30 ping: sendmsg: Network is unreachable
Nov 27 10:08:31 64 bytes from 2a02:c0::1: icmp_seq=17 ttl=62 time=1.50 ms
Nov 27 10:08:32 64 bytes from 2a02:c0::1: icmp_seq=18 ttl=62 time=0.683 ms
Nov 27 10:08:33 64 bytes from 2a02:c0::1: icmp_seq=19 ttl=62 time=0.677 ms
Nov 27 10:08:34 64 bytes from 2a02:c0::1: icmp_seq=20 ttl=62 time=0.729 ms

Removing the garp_master_* settings from keepalived.conf and HUP-ing it
solves the problem and makes the network usable again.

So the question you might be asking yourself at this point is probably
"why do the neighbour advertisement packets cause the default route to
be removed"? I am 99.9% certain it is because the NAs have the "router"
flag set to 0. (I have not verified it 100%, because I do not know how
to configure Keepalived to set the "router" flag, if that is even
possible.)

ubuntu@test:~$ sudo tshark -i eth0 -f 'icmp6 and ip6[40] == 136 and src host fe80::f816:3eff:feb8:3857' -c 1 -V
tshark: Lua: Error during loading:
 [string "/usr/share/wireshark/init.lua"]:46: dofile has been disabled due to running Wireshark as superuser. See http://wiki.wireshark.org/CaptureSetup/CapturePrivileges for help in running Wireshark as an unprivileged user.
Running as user "root" and group "root". This could be dangerous.
Capturing on 'eth0'
Frame 1: 86 bytes on wire (688 bits), 86 bytes captured (688 bits) on interface 0
    Interface id: 0
    Encapsulation type: Ethernet (1)
    Arrival Time: Nov 27, 2015 10:18:48.182563000 UTC
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1448619528.182563000 seconds
    [Time delta from previous captured frame: 0.000000000 seconds]
    [Time delta from previous displayed frame: 0.000000000 seconds]
    [Time since reference or first frame: 0.000000000 seconds]
    Frame Number: 1
    Frame Length: 86 bytes (688 bits)
    Capture Length: 86 bytes (688 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ipv6:icmpv6]
Ethernet II, Src: fa:16:3e:b8:38:57 (fa:16:3e:b8:38:57), Dst: IPv6mcast_00:00:00:01 (33:33:00:00:00:01)
    Destination: IPv6mcast_00:00:00:01 (33:33:00:00:00:01)
        Address: IPv6mcast_00:00:00:01 (33:33:00:00:00:01)
        .... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
        .... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast)
    Source: fa:16:3e:b8:38:57 (fa:16:3e:b8:38:57)
        Address: fa:16:3e:b8:38:57 (fa:16:3e:b8:38:57)
        .... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: IPv6 (0x86dd)
Internet Protocol Version 6, Src: fe80::f816:3eff:feb8:3857 (fe80::f816:3eff:feb8:3857), Dst: ff02::1 (ff02::1)
    0110 .... = Version: 6
        [0110 .... = This field makes the filter "ip.version == 6" possible: 6]
    .... 0000 0000 .... .... .... .... .... = Traffic class: 0x00000000
        .... 0000 00.. .... .... .... .... .... = Differentiated Services Field: Default (0x00000000)
        .... .... ..0. .... .... .... .... .... = ECN-Capable Transport (ECT): Not set
        .... .... ...0 .... .... .... .... .... = ECN-CE: Not set
    .... .... .... 0000 0000 0000 0000 0000 = Flowlabel: 0x00000000
    Payload length: 32
    Next header: ICMPv6 (58)
    Hop limit: 255
    Source: fe80::f816:3eff:feb8:3857 (fe80::f816:3eff:feb8:3857)
    Destination: ff02::1 (ff02::1)
    [Source GeoIP: Unknown]
    [Destination GeoIP: Unknown]
Internet Control Message Protocol v6
    Type: Neighbor Advertisement (136)
    Code: 0
    Checksum: 0x0c2b [correct]
    Flags: 0x20000000
        0... .... .... .... .... .... .... .... = Router: Not set     <--------------- HERE
        .0.. .... .... .... .... .... .... .... = Solicited: Not set
        ..1. .... .... .... .... .... .... .... = Override: Set
        ...0 0000 0000 0000 0000 0000 0000 0000 = Reserved: 0
    Target Address: fe80::f816:3eff:feb8:3857 (fe80::f816:3eff:feb8:3857)
    ICMPv6 Option (Target link-layer address : fa:16:3e:b8:38:57)
        Type: Target link-layer address (2)
        Length: 1 (8 bytes)
        Link-layer address: fa:16:3e:b8:38:57 (fa:16:3e:b8:38:57)

Tore

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: ipv6 l3-ha

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1520517

Title:
  IPv6 networking badly broken by garp_master_* keepalived settings

Status in neutron:
  New

Bug description:
  http://git.openstack.org/cgit/openstack/neutron/commit/?id=5d38dc5
  added the "garp_master_repeat 5" and "garp_master_refresh 10". This
  badly breaks networking to the point where it is completely unsuable:

  First of all, this setting causes Keepalived to constantly spam five
  unsolicited neighbour advertisements every ten seconds, as shown in
  this tcpdump (the active router is fe80::f816:3eff:feb8:3857):

  ubuntu@test:~$ sudo tcpdump -i eth0 host fe80::f816:3eff:feb8:3857 
  tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
  listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
  10:08:09.459090 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, router advertisement, length 56
  10:08:12.566305 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, router advertisement, length 56
  10:08:15.638044 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, router advertisement, length 56
  10:08:18.039185 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
  10:08:18.039275 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
  10:08:18.039496 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
  10:08:18.039581 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
  10:08:18.039595 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
  10:08:21.952451 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, router advertisement, length 56
  10:08:28.046863 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
  10:08:28.046944 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
  10:08:28.047045 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
  10:08:28.047986 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
  10:08:28.048033 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, neighbor advertisement, tgt is fe80::f816:3eff:feb8:3857, length 32
  10:08:30.931114 IP6 fe80::f816:3eff:feb8:3857 > ip6-allnodes: ICMP6, router advertisement, length 56

  That's not a problem in itself. The problem is however that these
  neighbour advertisements causes the instance to loose its default
  route, which stays gone until the next router advertisement packet
  arrives:

  ubuntu@test:~$ sudo ip -6 monitor route | ts
  Nov 27 10:08:09 default via fe80::f816:3eff:feb8:3857 dev eth0  proto ra  metric 1024 
  Nov 27 10:08:18 Deleted default via fe80::f816:3eff:feb8:3857 dev eth0  proto ra  metric 1024  expires 27sec hoplimit 64
  Nov 27 10:08:21 default via fe80::f816:3eff:feb8:3857 dev eth0  proto ra  metric 1024 
  Nov 27 10:08:28 Deleted default via fe80::f816:3eff:feb8:3857 dev eth0  proto ra  metric 1024  expires 23sec hoplimit 64
  Nov 27 10:08:30 default via fe80::f816:3eff:feb8:3857 dev eth0  proto ra  metric 1024 

  These periods without a default route obviously badly breaks network
  connectivity for the node, which is easily observable by a simple ping
  test:

  ubuntu@test:~$ ping6 2a02:c0::1 2>&1 | ts
  Nov 27 10:08:15 PING 2a02:c0::1(2a02:c0::1) 56 data bytes
  Nov 27 10:08:15 64 bytes from 2a02:c0::1: icmp_seq=1 ttl=62 time=0.570 ms
  Nov 27 10:08:16 64 bytes from 2a02:c0::1: icmp_seq=2 ttl=62 time=0.873 ms
  Nov 27 10:08:17 64 bytes from 2a02:c0::1: icmp_seq=3 ttl=62 time=0.666 ms
  Nov 27 10:08:18 ping: sendmsg: Network is unreachable
  Nov 27 10:08:19 ping: sendmsg: Network is unreachable
  Nov 27 10:08:20 ping: sendmsg: Network is unreachable
  Nov 27 10:08:21 ping: sendmsg: Network is unreachable
  Nov 27 10:08:22 64 bytes from 2a02:c0::1: icmp_seq=8 ttl=62 time=1.42 ms
  Nov 27 10:08:23 64 bytes from 2a02:c0::1: icmp_seq=9 ttl=62 time=0.785 ms
  Nov 27 10:08:24 64 bytes from 2a02:c0::1: icmp_seq=10 ttl=62 time=0.712 ms
  Nov 27 10:08:25 64 bytes from 2a02:c0::1: icmp_seq=11 ttl=62 time=0.724 ms
  Nov 27 10:08:26 64 bytes from 2a02:c0::1: icmp_seq=12 ttl=62 time=0.864 ms
  Nov 27 10:08:27 64 bytes from 2a02:c0::1: icmp_seq=13 ttl=62 time=0.652 ms
  Nov 27 10:08:28 ping: sendmsg: Network is unreachable
  Nov 27 10:08:29 ping: sendmsg: Network is unreachable
  Nov 27 10:08:30 ping: sendmsg: Network is unreachable
  Nov 27 10:08:31 64 bytes from 2a02:c0::1: icmp_seq=17 ttl=62 time=1.50 ms
  Nov 27 10:08:32 64 bytes from 2a02:c0::1: icmp_seq=18 ttl=62 time=0.683 ms
  Nov 27 10:08:33 64 bytes from 2a02:c0::1: icmp_seq=19 ttl=62 time=0.677 ms
  Nov 27 10:08:34 64 bytes from 2a02:c0::1: icmp_seq=20 ttl=62 time=0.729 ms

  Removing the garp_master_* settings from keepalived.conf and HUP-ing
  it solves the problem and makes the network usable again.

  So the question you might be asking yourself at this point is probably
  "why do the neighbour advertisement packets cause the default route to
  be removed"? I am 99.9% certain it is because the NAs have the
  "router" flag set to 0. (I have not verified it 100%, because I do not
  know how to configure Keepalived to set the "router" flag, if that is
  even possible.)

  ubuntu@test:~$ sudo tshark -i eth0 -f 'icmp6 and ip6[40] == 136 and src host fe80::f816:3eff:feb8:3857' -c 1 -V
  tshark: Lua: Error during loading:
   [string "/usr/share/wireshark/init.lua"]:46: dofile has been disabled due to running Wireshark as superuser. See http://wiki.wireshark.org/CaptureSetup/CapturePrivileges for help in running Wireshark as an unprivileged user.
  Running as user "root" and group "root". This could be dangerous.
  Capturing on 'eth0'
  Frame 1: 86 bytes on wire (688 bits), 86 bytes captured (688 bits) on interface 0
      Interface id: 0
      Encapsulation type: Ethernet (1)
      Arrival Time: Nov 27, 2015 10:18:48.182563000 UTC
      [Time shift for this packet: 0.000000000 seconds]
      Epoch Time: 1448619528.182563000 seconds
      [Time delta from previous captured frame: 0.000000000 seconds]
      [Time delta from previous displayed frame: 0.000000000 seconds]
      [Time since reference or first frame: 0.000000000 seconds]
      Frame Number: 1
      Frame Length: 86 bytes (688 bits)
      Capture Length: 86 bytes (688 bits)
      [Frame is marked: False]
      [Frame is ignored: False]
      [Protocols in frame: eth:ipv6:icmpv6]
  Ethernet II, Src: fa:16:3e:b8:38:57 (fa:16:3e:b8:38:57), Dst: IPv6mcast_00:00:00:01 (33:33:00:00:00:01)
      Destination: IPv6mcast_00:00:00:01 (33:33:00:00:00:01)
          Address: IPv6mcast_00:00:00:01 (33:33:00:00:00:01)
          .... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
          .... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast)
      Source: fa:16:3e:b8:38:57 (fa:16:3e:b8:38:57)
          Address: fa:16:3e:b8:38:57 (fa:16:3e:b8:38:57)
          .... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
          .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
      Type: IPv6 (0x86dd)
  Internet Protocol Version 6, Src: fe80::f816:3eff:feb8:3857 (fe80::f816:3eff:feb8:3857), Dst: ff02::1 (ff02::1)
      0110 .... = Version: 6
          [0110 .... = This field makes the filter "ip.version == 6" possible: 6]
      .... 0000 0000 .... .... .... .... .... = Traffic class: 0x00000000
          .... 0000 00.. .... .... .... .... .... = Differentiated Services Field: Default (0x00000000)
          .... .... ..0. .... .... .... .... .... = ECN-Capable Transport (ECT): Not set
          .... .... ...0 .... .... .... .... .... = ECN-CE: Not set
      .... .... .... 0000 0000 0000 0000 0000 = Flowlabel: 0x00000000
      Payload length: 32
      Next header: ICMPv6 (58)
      Hop limit: 255
      Source: fe80::f816:3eff:feb8:3857 (fe80::f816:3eff:feb8:3857)
      Destination: ff02::1 (ff02::1)
      [Source GeoIP: Unknown]
      [Destination GeoIP: Unknown]
  Internet Control Message Protocol v6
      Type: Neighbor Advertisement (136)
      Code: 0
      Checksum: 0x0c2b [correct]
      Flags: 0x20000000
          0... .... .... .... .... .... .... .... = Router: Not set     <--------------- HERE
          .0.. .... .... .... .... .... .... .... = Solicited: Not set
          ..1. .... .... .... .... .... .... .... = Override: Set
          ...0 0000 0000 0000 0000 0000 0000 0000 = Reserved: 0
      Target Address: fe80::f816:3eff:feb8:3857 (fe80::f816:3eff:feb8:3857)
      ICMPv6 Option (Target link-layer address : fa:16:3e:b8:38:57)
          Type: Target link-layer address (2)
          Length: 1 (8 bytes)
          Link-layer address: fa:16:3e:b8:38:57 (fa:16:3e:b8:38:57)

  Tore

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1520517/+subscriptions


Follow ups