← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1775797] Re: The mac table size of neutron bridges (br-tun, br-int, br-*) is too small by default and eventually makes openvswitch explode

 

Reviewed:  https://review.openstack.org/573696
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1f8378e0ac4b8c3fc4670144e6efc51940d796ad
Submitter: Zuul
Branch:    master

commit 1f8378e0ac4b8c3fc4670144e6efc51940d796ad
Author: Slawek Kaplonski <skaplons@xxxxxxxxxx>
Date:   Fri Jun 8 15:37:39 2018 +0200

    [OVS] Add mac-table-size to be set on each ovs bridge
    
    By default number of MAC addresses which ovs stores in memory
    is quite low - 2048.
    
    Any eviction of a MAC learning table entry triggers revalidation.
    Such revalidation is very costly so it cause high CPU usage by
    ovs-vswitchd process.
    
    To workaround this problem, higher value of mac-table-size
    option can be set for bridge. Then this revalidation will happen
    less often and CPU usage will be lower.
    This patch adds config option for neutron-openvswitch-agent to allow
    users tune this setting in bridges managed by agent.
    By default this value is set to 50000 which should be enough for most
    systems.
    
    Change-Id: If628f52d75c2b5fec87ad61e0219b3286423468c
    Closes-Bug: #1775797


** Changed in: neutron
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1775797

Title:
  The mac table size of neutron bridges (br-tun, br-int, br-*) is too
  small by default and eventually makes openvswitch explode

Status in neutron:
  Fix Released

Bug description:
  Description of problem:

  the CPU utilization of ovs-vswitchd is high without DPDK enabled

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  1512 root      10 -10 4352840 793864  12008 R  1101  0.3  15810:26 ovs-vswitchd

  at the same time we were observing failures to send packets (ICMP)
  over VXLAN tunnel, we think this might be related to high CPU usage.

  
  --- Reproducer and analysis on ovs side done by Jiri Benc:

  Reproducer:

  Create an ovs bridge:

  ------
  ovs-vsctl add-br ovs0
  ip l s ovs0 up
  ------

  Save this to a file named "reproducer.py":

  ------
  #!/usr/bin/python
  from scapy.all import *

  data = [(str(RandMAC()), str(RandIP())) for i in
  range(int(sys.argv[1]))]

  s = conf.L2socket(iface="ovs0")
  while True:
      for mac, ip in data:
          p = Ether(src=mac, dst=mac)/IP(src=ip, dst=ip)
          s.send(p)
  ------

  Run the reproducer:

  ./reproducer.py 5000

  
  ----
  The problem is how flow revalidation works in ovs. There are several 'revalidator' threads launched. They should normally sleep (modulo waking every 0.5 second just to do nothing) and they wake if anything of interest happens (udpif_revalidator => poll_block). On every wake up, each revalidator thread checks whether flow revalidation is needed and if it is, it does the revalidation.

  The revalidation is very costly with high number of flows. I also
  suspect there's a lot of contention between the revalidator threads.

  The flow revalidation is triggered by many things. What is of interest
  for us is that any eviction of a MAC learning table entry triggers
  revalidation.

  The reproducer script repeatedly sends the same 5000 packets, all of
  them with a different MAC address. This causes constant overflows of
  the MAC learning table and constant revalidation. The revalidator
  threads are being immediately woken up and are busy looping the
  revalidation.

  Which is exactly the pattern from the customers' data: there are
  16000+ flows and the packet capture shows that the packets are
  repeating every second.

  A quick fix is to increase the MAC learning table size:

  ovs-vsctl set bridge <bridge> other-config:mac-table-size=50000

  This should lower the CPU usage down substantially; allow a few
  seconds for things to settle down.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1775797/+subscriptions


References