← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1784155] Re: nova_placement service start not coordinated with api db sync on multiple controllers

 

Fix for tripleo master released with:
https://review.openstack.org/610966

Backport to rocky tracked in:
https://review.openstack.org/618958

Backport to queens tracked in:
https://review.openstack.org/618984


** Changed in: tripleo
       Status: Triaged => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1784155

Title:
  nova_placement service start not coordinated with api db sync on
  multiple controllers

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) rocky series:
  In Progress
Status in tripleo:
  Fix Released

Bug description:
  On a loaded HA / galera environment using VMs I can fairly
  consistently reproduce a race condition where the nova_placement
  service is started on controllers where the database is not yet
  available.   The nova_placement service itself does not seem to be
  able to tolerate this condition upon startup and it then fails to
  recover.   Mitigation here can either involve synchronizing these
  conditions or getting nova-placement to be more resilient.

  The symptoms of overcloud deploy failure look like two out of three
  controllers having the nova_placement container in an unhealthy state:

  TASK [Debug output for task which failed: Check for unhealthy containers after step 3] ***
  Saturday 28 July 2018  10:19:29 +0000 (0:00:00.663)       0:30:26.152 ********* 
  fatal: [stack2-overcloud-controller-2]: FAILED! => {
      "failed_when_result": true, 
      "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
          "3597b92e9714        192.168.25.1:8787/tripleomaster/centos-binary-nova-placement-api:959e1d7f755ee681b6f23b498d262a9e4dd6326f_4cbb1814   \"kolla_start\"       2 minutes ago       Up 2 minutes (unhealthy)                       nova_placement"
      ]
  }
  fatal: [stack2-overcloud-controller-1]: FAILED! => {
      "failed_when_result": true, 
      "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
          "322c5ea53895        192.168.25.1:8787/tripleomaster/centos-binary-nova-placement-api:959e1d7f755ee681b6f23b498d262a9e4dd6326f_4cbb1814   \"kolla_start\"       2 minutes ago       Up 2 minutes (unhealthy)                       nova_placement"
      ]
  }
  ok: [stack2-overcloud-controller-0] => {
      "failed_when_result": false, 
      "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []
  }
  ok: [stack2-overcloud-compute-0] => {
      "failed_when_result": false, 
      "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []
  }

  NO MORE HOSTS LEFT
  *************************************************************

  
  inspecting placement_wsgi_error.log shows the first stack trace that the nova_placement database is missing the "traits" table:

  [Sat Jul 28 10:17:06.525018 2018] [:error] [pid 14] [remote 10.1.20.15:0] mod_wsgi (pid=14): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module.
  [Sat Jul 28 10:17:06.525067 2018] [:error] [pid 14] [remote 10.1.20.15:0] mod_wsgi (pid=14): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'.
  [Sat Jul 28 10:17:06.525101 2018] [:error] [pid 14] [remote 10.1.20.15:0] Traceback (most recent call last):
  [Sat Jul 28 10:17:06.525124 2018] [:error] [pid 14] [remote 10.1.20.15:0]   File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module>
  [Sat Jul 28 10:17:06.525165 2018] [:error] [pid 14] [remote 10.1.20.15:0]     application = init_application()
  [Sat Jul 28 10:17:06.525174 2018] [:error] [pid 14] [remote 10.1.20.15:0]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 88, in init_application
  [Sat Jul 28 10:17:06.525198 2018] [:error] [pid 14] [remote 10.1.20.15:0]     return deploy.loadapp(conf.CONF)
  [Sat Jul 28 10:17:06.525205 2018] [:error] [pid 14] [remote 10.1.20.15:0]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/deploy.py", line 111, in loadapp
  [Sat Jul 28 10:17:06.525300 2018] [:error] [pid 14] [remote 10.1.20.15:0]     update_database()
  [Sat Jul 28 10:17:06.525310 2018] [:error] [pid 14] [remote 10.1.20.15:0]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/deploy.py", line 92, in update_database
  [Sat Jul 28 10:17:06.525329 2018] [:error] [pid 14] [remote 10.1.20.15:0]     resource_provider.ensure_trait_sync(ctx)
  [Sat Jul 28 10:17:06.525337 2018] [:error] [pid 14] [remote 10.1.20.15:0]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py", line 146, in ensure_trait_sync
  [Sat Jul 28 10:17:06.526277 2018] [:error] [pid 14] [remote 10.1.20.15:0]     _trait_sync(ctx)

  ...

  [Sat Jul 28 10:17:06.531950 2018] [:error] [pid 14] [remote 10.1.20.15:0]     raise errorclass(errno, errval)
  [Sat Jul 28 10:17:06.532049 2018] [:error] [pid 14] [remote 10.1.20.15:0] ProgrammingError: (pymysql.err.ProgrammingError) (1146, u"Table 'nova_placement.traits' doesn't exist") [SQL: u'SELECT traits.name \\nFROM traits'] (Background on this error at: http://sqlalche.me/e/f405)

  
  the log then continues with the following exception repeated every 30 seconds, which is likely the service trying to recover but it's hitting some problem with how it uses oslo.config in process:

  [Sat Jul 28 10:18:36.916617 2018] [:error] [pid 14] [remote 10.1.20.15:148] mod_wsgi (pid=14): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module.
  [Sat Jul 28 10:18:36.916646 2018] [:error] [pid 14] [remote 10.1.20.15:148] mod_wsgi (pid=14): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'.
  [Sat Jul 28 10:18:36.916664 2018] [:error] [pid 14] [remote 10.1.20.15:148] Traceback (most recent call last):
  [Sat Jul 28 10:18:36.916681 2018] [:error] [pid 14] [remote 10.1.20.15:148]   File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module>
  [Sat Jul 28 10:18:36.916711 2018] [:error] [pid 14] [remote 10.1.20.15:148]     application = init_application()
  [Sat Jul 28 10:18:36.916719 2018] [:error] [pid 14] [remote 10.1.20.15:148]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 75, in init_application
  [Sat Jul 28 10:18:36.916742 2018] [:error] [pid 14] [remote 10.1.20.15:148]     _parse_args([], default_config_files=[conffile])
  [Sat Jul 28 10:18:36.916748 2018] [:error] [pid 14] [remote 10.1.20.15:148]   File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 61, in _parse_args
  [Sat Jul 28 10:18:36.916760 2018] [:error] [pid 14] [remote 10.1.20.15:148]     logging.register_options(conf.CONF)
  [Sat Jul 28 10:18:36.916766 2018] [:error] [pid 14] [remote 10.1.20.15:148]   File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 250, in register_options
  [Sat Jul 28 10:18:36.916779 2018] [:error] [pid 14] [remote 10.1.20.15:148]     conf.register_cli_opts(_options.common_cli_opts)
  [Sat Jul 28 10:18:36.916785 2018] [:error] [pid 14] [remote 10.1.20.15:148]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2468, in __inner
  [Sat Jul 28 10:18:36.916807 2018] [:error] [pid 14] [remote 10.1.20.15:148]     result = f(self, *args, **kwargs)
  [Sat Jul 28 10:18:36.916813 2018] [:error] [pid 14] [remote 10.1.20.15:148]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2690, in register_cli_opts
  [Sat Jul 28 10:18:36.916823 2018] [:error] [pid 14] [remote 10.1.20.15:148]     self.register_cli_opt(opt, group, clear_cache=False)
  [Sat Jul 28 10:18:36.916828 2018] [:error] [pid 14] [remote 10.1.20.15:148]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2472, in __inner
  [Sat Jul 28 10:18:36.916838 2018] [:error] [pid 14] [remote 10.1.20.15:148]     return f(self, *args, **kwargs)
  [Sat Jul 28 10:18:36.916843 2018] [:error] [pid 14] [remote 10.1.20.15:148]   File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2682, in register_cli_opt
  [Sat Jul 28 10:18:36.916852 2018] [:error] [pid 14] [remote 10.1.20.15:148]     raise ArgsAlreadyParsedError("cannot register CLI option")
  [Sat Jul 28 10:18:36.916894 2018] [:error] [pid 14] [remote 10.1.20.15:148] ArgsAlreadyParsedError: arguments already parsed: cannot register CLI option


  
  in nova_placement.yaml we can see that the placement service is started earlier than others, in step 3:

        docker_config:
          step_2:
            get_attr: [NovaPlacementLogging, docker_config, step_2]
          # start this early so it is up before computes start reporting
          step_3:
            nova_placement:
              start_order: 1
              image: {get_param: DockerNovaPlacementImage}
              net: host
              user: root
              restart: always

  however in nova-api.yaml we can see that the api DB sync is also in
  step_3, which although it includes start_order: 0, I'm going to guess
  that the "start_order" system is not coordinated across controllers:

        docker_config:
          step_2:
            get_attr: [NovaApiLogging, docker_config, step_2]
          step_3:
            nova_api_db_sync:
              start_order: 0
              image: &nova_api_image {get_param: DockerNovaApiImage}
              net: host
              detach: false
              user: root
              volumes: &nova_api_bootstrap_volumes
                list_concat:
                  - {get_attr: [ContainersCommon, volumes]}
                  - {get_attr: [NovaApiLogging, volumes]}
                  -
                    - /var/lib/config-data/nova/etc/my.cnf.d/tripleo.cnf:/etc/my.cnf.d/tripleo.cnf:ro
                    - /var/lib/config-data/nova/etc/nova/:/etc/nova/:ro
              command: "/usr/bin/bootstrap_host_exec nova_api su nova -s /bin/bash -c '/usr/bin/nova-manage api_db sync'"

  
  the placement service always starts correctly on the "bootstrap" controller since start_order is coordinating, but this does not carry over to the other two controllers.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1784155/+subscriptions