← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1751923] Re: _heal_instance_info_cache periodic task bases on port list from nova db, not from neutron server

 

Reviewed:  https://review.openstack.org/591607
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ba44c155ce1dcefede9741722a0525820d6da2b8
Submitter: Zuul
Branch:    master

commit ba44c155ce1dcefede9741722a0525820d6da2b8
Author: Matt Riedemann <mriedem.os@xxxxxxxxx>
Date:   Tue Aug 14 17:57:53 2018 +0800

    Force refresh instance info_cache during heal
    
    If the instance info_cache is corrupted somehow, like during
    a host reboot and the ports aren't wired up properly or
    a mistaken policy change in neutron results in nova resetting
    the info_cache to an empty list, the _heal_instance_info_cache
    is meant to fix it (once the current state of the ports for
    the instance in neutron is corrected). However, the task is
    currently only refreshing the cache *based* on the current contents
    of the cache, which defeats the purpose of neutron being the source
    of truth for the ports attached to the instance.
    
    This change makes the _heal_instance_info_cache periodic task
    pass a "force_refresh" kwarg, which defaults to False for backward
    compatibility with other methods that refresh the cache after
    operations like attach/detach interface, and if True will make
    nova get the current state of the ports for the instance from neutron
    and fully rebuild the info_cache.
    
    To not lose port order in info_cache this change takes original order
    from nova historical data that are stored as VirtualInterfaceList
    objects. For ports that are not registered as VirtualInterfaces
    objects it will add them at the end of port_order list. Due to this
    for instances older than Newton another patch was introduced to fill
    missing VirtualInterface objects in the DB [1].
    
    Long-term we should be able to refactor some of the older refresh
    code which leverages the cache to instead use the refresh_vif_id
    kwarg so that we do targeted cache updates when we do things like
    attach and detach ports, but that's a change for another day.
    
    [1] https://review.openstack.org/#/c/614167
    
    Co-Authored-By: Maciej Jozefczyk <maciej.jozefczyk@xxxxxxxxxxxx>
    Change-Id: I629415236b2447128ae9a980d4ebe730a082c461
    Closes-Bug: #1751923


** Changed in: nova
       Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1751923

Title:
  _heal_instance_info_cache periodic task bases on port list from nova
  db, not from neutron server

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  Description
  ===========

  During periodic task _heal_instance_info_cache the
  instance_info_caches are not updated using instance port_ids taken
  from neutron, but from nova db.

  Sometimes, perhaps because of some race-condition, its possible to
  lose some ports from instance_info_caches. Periodic task
  _heal_instance_info_cache should clean this up (add missing records),
  but in fact it's not working this way.

  How it looks now?
  =================

  _heal_instance_info_cache during crontask:

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/compute/manager.py#L6525

  is using network_api to get instance_nw_info (instance_info_caches):

              try:
                  # Call to network API to get instance info.. this will
                  # force an update to the instance's info_cache
                  self.network_api.get_instance_nw_info(context, instance)

  self.network_api.get_instance_nw_info() is listed below:

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1377

  and it uses _build_network_info_model() without networks and port_ids
  parameters (because we're not adding any new interface to instance):

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2356

  Next: _gather_port_ids_and_networks() generates the list of instance
  networks and port_ids:

        networks, port_ids = self._gather_port_ids_and_networks(
                  context, instance, networks, port_ids, client)

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L2389-L2390

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/network/neutronv2/api.py#L1393

  As we see that _gather_port_ids_and_networks() takes the port list
  from DB:

  https://github.com/openstack/nova/blob/ef4000a0d326deb004843ee51d18030224c5630f/nova/objects/instance.py#L1173-L1176

  And thats it. When we lose a port its not possible to add it again with this periodic task.
  The only way is to clean device_id field in neutron port object and re-attach the interface using `nova interface-attach`.

  When the interface is missing and there is no port configured on
  compute host (for example after compute reboot) - interface is not
  added to instance and from neutron point of view port state is DOWN.

  When the interface is missing in cache and we reboot hard the instance
  - its not added as tapinterface in xml file = we don't have the
  network on host.

  Steps to reproduce
  ==================
  1. Spawn devstack
  2. Spawn VM inside devstack with multiple ports (for example also from 2 different networks)
  3. Update the DB row, drop one interface from interfaces_list
  4. Hard-Reboot the instance
  5. See that nova list shows instance without one address, but nova interface-list shows all addresses
  6. See that one port is missing in instance xml files
  7. In theory the _heal_instance_info_cache should fix this things, it relies on memory, not on the fresh list of instance ports taken from neutron.

  Reproduced Example
  ==================
  1. Spawn VM with 1 private network port
  nova boot --flavor m1.small --image cirros-0.3.5-x86_64-disk --nic net-name=private  test-2
  2. Attach ports to have 2 private and 2 public interfaces
  nova list:
  | a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | -          | Running     | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fee8:3333, 10.0.0.3, fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |

  So we see 4 ports:
  stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  | Port State | Port ID                              | Net ID                               | IP addresses                                  | MAC Addr          |
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  | ACTIVE     | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
  | ACTIVE     | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c                       | fa:16:3e:6d:dc:85 |
  | ACTIVE     | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e                       | fa:16:3e:cf:0c:e0 |
  | ACTIVE     | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  stack@mjozefcz-devstack-ptg:~$

  We can also see 4 tap interfaces in xml file:

  stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002 | grep -i tap
        <target dev='tap6c230305-43'/>
        <target dev='tapb89d6863-fb'/>
        <target dev='tapa74c9ee8-c4'/>
        <target dev='tap71e6c6ad-80'/>
  stack@mjozefcz-devstack-ptg:~$

  3. Now lets 'corrupt' the instance_info_caches for this specific VM.
  We also noticed some race-condition that cause the same problem, but
  we're unable to reproduce it in devel environment.

  Original one:

  ---
  mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
  *************************** 1. row ***************************
     created_at: 2018-02-26 21:25:31
     updated_at: 2018-02-26 21:29:17
     deleted_at: NULL
             id: 2
   network_info: [{"profile": {}, "ovs_interfaceid": "6c230305-43f8-42ec-9936-61fe67551168", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fee8:3333"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.3"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tap6c230305-43", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:e8:33:33", "active": true, "type": "ovs", "id": "6c230305-43f8-42ec-9936-61fe67551168", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
  instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
        deleted: 0
  1 row in set (0.00 sec)
  ----

  Modified one (I removed first port from list):
  tap6c230305-43

  ----
  mysql> select * from instance_info_caches where instance_uuid="a64ed18d-9868-4bf0-90d3-d710d278922d"\G;
  *************************** 1. row ***************************
     created_at: 2018-02-26 21:25:31
     updated_at: 2018-02-26 21:29:17
     deleted_at: NULL
             id: 2
   network_info: [{"profile": {}, "ovs_interfaceid": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "fdda:5d77:e18e:0:f816:3eff:fe53:231c"}], "version": 6, "meta": {"ipv6_address_mode": "slaac", "dhcp_server": "fdda:5d77:e18e:0:f816:3eff:fee7:b04"}, "dns": [], "routes": [], "cidr": "fdda:5d77:e18e::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "fdda:5d77:e18e::1"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.0.0.5"}], "version": 4, "meta": {"dhcp_server": "10.0.0.2"}, "dns": [], "routes": [], "cidr": "10.0.0.0/26", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "10.0.0.1"}}], "meta": {"injected": false, "tenant_id": "0314943f52014a5b9bc56b73bec475e6", "mtu": 1450}, "id": "96343d33-5dd2-4289-b0cc-e6c664c2ddd9", "label": "private"}, "devname": "tapb89d6863-fb", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:53:23:1c", "active": true, "type": "ovs", "id": "b89d6863-fb4c-405c-89f9-698bd9773ad6", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::e"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.15"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tapa74c9ee8-c4", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:cf:0c:e0", "active": true, "type": "ovs", "id": "a74c9ee8-c426-48ef-890f-3988ecbe95ff", "qbg_params": null}, {"profile": {}, "ovs_interfaceid": "71e6c6ad-8016-450f-93f2-75e7e014084d", "preserve_on_delete": false, "network": {"bridge": "br-int", "subnets": [{"ips": [{"meta": {}, "version": 6, "type": "fixed", "floating_ips": [], "address": "2001:db8::c"}], "version": 6, "meta": {}, "dns": [], "routes": [], "cidr": "2001:db8::/64", "gateway": {"meta": {}, "version": 6, "type": "gateway", "address": "2001:db8::2"}}, {"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "172.24.4.16"}], "version": 4, "meta": {}, "dns": [], "routes": [], "cidr": "172.24.4.0/24", "gateway": {"meta": {}, "version": 4, "type": "gateway", "address": "172.24.4.1"}}], "meta": {"injected": false, "tenant_id": "9c6f74dab29f4c738e82320075fa1f57", "mtu": 1500}, "id": "9e702a96-2744-40a2-a649-33f935d83ad3", "label": "public"}, "devname": "tap71e6c6ad-80", "vnic_type": "normal", "qbh_params": null, "meta": {}, "details": {"port_filter": true, "datapath_type": "system", "ovs_hybrid_plug": true}, "address": "fa:16:3e:6d:dc:85", "active": true, "type": "ovs", "id": "71e6c6ad-8016-450f-93f2-75e7e014084d", "qbg_params": null}]
  instance_uuid: a64ed18d-9868-4bf0-90d3-d710d278922d
        deleted: 0
  ----

  4. Now lets take a look on `nova list`:

  stack@mjozefcz-devstack-ptg:~$ nova list | grep test-2
  | a64ed18d-9868-4bf0-90d3-d710d278922d | test-2 | ACTIVE | -          | Running     | public=2001:db8::e, 172.24.4.15, 2001:db8::c, 172.24.4.16; private=fdda:5d77:e18e:0:f816:3eff:fe53:231c, 10.0.0.5 |
  stack@mjozefcz-devstack-ptg:~$

  So as you see we missed one interface (private).

  Nova interface-list shows it (because it calls neutron instead nova
  itself):

  stack@mjozefcz-devstack-ptg:~$ nova interface-list a64ed18d-9868-4bf0-90d3-d710d278922d
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  | Port State | Port ID                              | Net ID                               | IP addresses                                  | MAC Addr          |
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  | ACTIVE     | 6c230305-43f8-42ec-9936-61fe67551168 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.3,fdda:5d77:e18e:0:f816:3eff:fee8:3333 | fa:16:3e:e8:33:33 |
  | ACTIVE     | 71e6c6ad-8016-450f-93f2-75e7e014084d | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.16,2001:db8::c                       | fa:16:3e:6d:dc:85 |
  | ACTIVE     | a74c9ee8-c426-48ef-890f-3988ecbe95ff | 9e702a96-2744-40a2-a649-33f935d83ad3 | 172.24.4.15,2001:db8::e                       | fa:16:3e:cf:0c:e0 |
  | ACTIVE     | b89d6863-fb4c-405c-89f9-698bd9773ad6 | 96343d33-5dd2-4289-b0cc-e6c664c2ddd9 | 10.0.0.5,fdda:5d77:e18e:0:f816:3eff:fe53:231c | fa:16:3e:53:23:1c |
  +------------+--------------------------------------+--------------------------------------+-----------------------------------------------+-------------------+
  stack@mjozefcz-devstack-ptg:~$

  5. During this time check the logs - yes, the
  _heal_instance_info_cache has been running for a while but without
  success - stil missing port in instance_info_caches table:

  Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG oslo_service.periodic_task [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Running periodic task ComputeManager._heal_instance_info_cache {{(pid=27459) run_periodic_tasks /usr/local/lib/python2.7/dist-packages/oslo_service/periodic_task.py:215}}
  Feb 26 22:12:03 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] Starting heal instance info cache {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6541}}
  Feb 26 22:12:04 mjozefcz-devstack-ptg nova-compute[27459]: DEBUG nova.compute.manager [None req-ac707da5-3413-412c-b314-ab38db2134bc service nova] [instance: a64ed18d-9868-4bf0-90d3-d710d278922d] Updated the network info_cache for instance {{(pid=27459) _heal_instance_info_cache /opt/stack/nova/nova/compute/manager.py:6603}}

  5. Ok, so lets pretend that customer restart the VM.
  stack@mjozefcz-devstack-ptg:~$ nova reboot a64ed18d-9868-4bf0-90d3-d710d278922d --hard
  Request to reboot server <Server: test-2> has been accepted.

  6. And now check connected interfaces - WOOPS there is no
  `tap6c230305-43` on the list ;(

  stack@mjozefcz-devstack-ptg:~$ sudo virsh dumpxml instance-00000002  | grep -i tap
        <target dev='tapb89d6863-fb'/>
        <target dev='tapa74c9ee8-c4'/>
        <target dev='tap71e6c6ad-80'/>

  Environment
  ===========
  Nova master branch, devstack

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1751923/+subscriptions


References