← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1467581] [NEW] Concurrent interface attachment corrupts info cache

 

Public bug reported:

Concurrently attaching multiple network interfaces to a single instance
can often result in corruption of the instance's information cache in
Nova. The result is that some network interfaces may be missing from
'nova list', and silently fail to detach when 'nova interface-detach' is
run. The ports are listed in 'nova interface-list', however, and can be
seen in 'neutron port-list'.

Initially seen on CentOS7 running Juno. Reproduced on Ubuntu 14.04
running devstack (master branch).

This issue is similar (possibly identical) to bug 1326183, and the steps
to reproduce it are similar also.

1) Devstack with trunk with the following local.conf:
disable_service n-net
enable_service q-svc
enable_service q-agt
enable_service q-dhcp
enable_service q-meta
RECLONE=yes
# and other options as set in the trunk's local

2) Create few networks:
$> neutron net-create testnet1
$> neutron net-create testnet2
$> neutron net-create testnet3
$> neutron subnet-create testnet1 192.168.1.0/24
$> neutron subnet-create testnet2 192.168.2.0/24
$> neutron subnet-create testnet3 192.168.3.0/24

3) Create a testvm in testnet1:
$> nova boot --flavor m1.tiny --image cirros-0.3.4-x86_64-uec --nic net-id=`neutron net-list | grep testnet1 | cut -f 2 -d ' '` testvm

4) Run the following shell script to attach and detach interfaces for this vm in the remaining two networks in a loop until we run into the issue at hand:
---------
#! /bin/bash
c=10000
netid1=`neutron net-list | grep testnet2 | cut -f 2 -d ' '`
netid2=`neutron net-list | grep testnet3 | cut -f 2 -d ' '`
while [ $c -gt 0 ]
do
   echo "Round: " $c
   echo -n "Attaching two interfaces concurrently... "
   nova interface-attach --net-id $netid1 testvm &
   nova interface-attach --net-id $netid2 testvm &
   wait
   echo "Done"
   echo "Sleeping until both those show up in nova show"
   waittime=0
   while [ $waittime -lt 60 ]
   do
       count=`nova show testvm | grep testnet | wc -l`
       if [ $count -eq 3 ]
       then
           break
       fi
       sleep 2
       (( waittime+=2 ))
   done
   echo "Waited for " $waittime " seconds"
   if [ $waittime -ge 60 ]
   then
      echo "bad case"
      exit 1
   fi
   echo "Detaching both... "
   nova interface-list testvm | grep $netid1 | awk '{print "deleting ",$4; system("nova interface-detach testvm "$4 " ; sleep 2");}'
   nova interface-list testvm | grep $netid2 | awk '{print "deleting ",$4; system("nova interface-detach testvm "$4 " ; sleep 2");}'
   echo "Done; check interfaces are gone in a minute."
   waittime=0
   while [ $waittime -lt 60 ]
   do
       count=`nova interface-list testvm | wc -l`
       echo "line count: " $count
       if [ $count -eq 5 ]
       then
           break
       fi
       sleep 2
       (( waittime+=2 ))
   done
   if [ $waittime -ge 60 ]
   then
      echo "failed to detach interfaces - raise another bug!"
      exit 1
   fi
   echo "Interfaces are gone"
   (( c-- ))
done
---------

Eventually the test will stop with a failure ("bad case") and the
interface remaining either from testnet2 or testnet3 can not be detached
at all.

For me, eventually is every time.

Based on my analysis of the source code, the concurrent requests cause
corruption of the instance network info cache. Each takes a copy of the
info cache at the start of the request processing, which contains only
the initial network. Each request thread then allocates a network port
and adds it to the network info. This info object is then saved back to
the DB. In each case, the info contains the initial network and the
network that has been added by that thread. Therefore, the last thread
to save wins, and the other network is lost.

I have a patch that appears to fix the issue, by refreshing the info
cache whilst holding the refresh-cache-<id> lock. However, I'm not
intimately familiar with the nova networking code so would appreciate
more experienced eyes on it. I will submit the change to gerrit for
analysis and comments.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1467581

Title:
  Concurrent interface attachment corrupts info cache

Status in OpenStack Compute (Nova):
  New

Bug description:
  Concurrently attaching multiple network interfaces to a single
  instance can often result in corruption of the instance's information
  cache in Nova. The result is that some network interfaces may be
  missing from 'nova list', and silently fail to detach when 'nova
  interface-detach' is run. The ports are listed in 'nova interface-
  list', however, and can be seen in 'neutron port-list'.

  Initially seen on CentOS7 running Juno. Reproduced on Ubuntu 14.04
  running devstack (master branch).

  This issue is similar (possibly identical) to bug 1326183, and the
  steps to reproduce it are similar also.

  1) Devstack with trunk with the following local.conf:
  disable_service n-net
  enable_service q-svc
  enable_service q-agt
  enable_service q-dhcp
  enable_service q-meta
  RECLONE=yes
  # and other options as set in the trunk's local

  2) Create few networks:
  $> neutron net-create testnet1
  $> neutron net-create testnet2
  $> neutron net-create testnet3
  $> neutron subnet-create testnet1 192.168.1.0/24
  $> neutron subnet-create testnet2 192.168.2.0/24
  $> neutron subnet-create testnet3 192.168.3.0/24

  3) Create a testvm in testnet1:
  $> nova boot --flavor m1.tiny --image cirros-0.3.4-x86_64-uec --nic net-id=`neutron net-list | grep testnet1 | cut -f 2 -d ' '` testvm

  4) Run the following shell script to attach and detach interfaces for this vm in the remaining two networks in a loop until we run into the issue at hand:
  ---------
  #! /bin/bash
  c=10000
  netid1=`neutron net-list | grep testnet2 | cut -f 2 -d ' '`
  netid2=`neutron net-list | grep testnet3 | cut -f 2 -d ' '`
  while [ $c -gt 0 ]
  do
     echo "Round: " $c
     echo -n "Attaching two interfaces concurrently... "
     nova interface-attach --net-id $netid1 testvm &
     nova interface-attach --net-id $netid2 testvm &
     wait
     echo "Done"
     echo "Sleeping until both those show up in nova show"
     waittime=0
     while [ $waittime -lt 60 ]
     do
         count=`nova show testvm | grep testnet | wc -l`
         if [ $count -eq 3 ]
         then
             break
         fi
         sleep 2
         (( waittime+=2 ))
     done
     echo "Waited for " $waittime " seconds"
     if [ $waittime -ge 60 ]
     then
        echo "bad case"
        exit 1
     fi
     echo "Detaching both... "
     nova interface-list testvm | grep $netid1 | awk '{print "deleting ",$4; system("nova interface-detach testvm "$4 " ; sleep 2");}'
     nova interface-list testvm | grep $netid2 | awk '{print "deleting ",$4; system("nova interface-detach testvm "$4 " ; sleep 2");}'
     echo "Done; check interfaces are gone in a minute."
     waittime=0
     while [ $waittime -lt 60 ]
     do
         count=`nova interface-list testvm | wc -l`
         echo "line count: " $count
         if [ $count -eq 5 ]
         then
             break
         fi
         sleep 2
         (( waittime+=2 ))
     done
     if [ $waittime -ge 60 ]
     then
        echo "failed to detach interfaces - raise another bug!"
        exit 1
     fi
     echo "Interfaces are gone"
     (( c-- ))
  done
  ---------

  Eventually the test will stop with a failure ("bad case") and the
  interface remaining either from testnet2 or testnet3 can not be
  detached at all.

  For me, eventually is every time.

  Based on my analysis of the source code, the concurrent requests cause
  corruption of the instance network info cache. Each takes a copy of
  the info cache at the start of the request processing, which contains
  only the initial network. Each request thread then allocates a network
  port and adds it to the network info. This info object is then saved
  back to the DB. In each case, the info contains the initial network
  and the network that has been added by that thread. Therefore, the
  last thread to save wins, and the other network is lost.

  I have a patch that appears to fix the issue, by refreshing the info
  cache whilst holding the refresh-cache-<id> lock. However, I'm not
  intimately familiar with the nova networking code so would appreciate
  more experienced eyes on it. I will submit the change to gerrit for
  analysis and comments.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1467581/+subscriptions


Follow ups

References