yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #76606
[Bug 1784155] Re: nova_placement service start not coordinated with api db sync on multiple controllers
** Changed in: nova/rocky
Status: Fix Committed => Fix Released
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1784155
Title:
nova_placement service start not coordinated with api db sync on
multiple controllers
Status in OpenStack Compute (nova):
Fix Released
Status in OpenStack Compute (nova) rocky series:
Fix Released
Status in tripleo:
Fix Released
Bug description:
On a loaded HA / galera environment using VMs I can fairly
consistently reproduce a race condition where the nova_placement
service is started on controllers where the database is not yet
available. The nova_placement service itself does not seem to be
able to tolerate this condition upon startup and it then fails to
recover. Mitigation here can either involve synchronizing these
conditions or getting nova-placement to be more resilient.
The symptoms of overcloud deploy failure look like two out of three
controllers having the nova_placement container in an unhealthy state:
TASK [Debug output for task which failed: Check for unhealthy containers after step 3] ***
Saturday 28 July 2018 10:19:29 +0000 (0:00:00.663) 0:30:26.152 *********
fatal: [stack2-overcloud-controller-2]: FAILED! => {
"failed_when_result": true,
"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
"3597b92e9714 192.168.25.1:8787/tripleomaster/centos-binary-nova-placement-api:959e1d7f755ee681b6f23b498d262a9e4dd6326f_4cbb1814 \"kolla_start\" 2 minutes ago Up 2 minutes (unhealthy) nova_placement"
]
}
fatal: [stack2-overcloud-controller-1]: FAILED! => {
"failed_when_result": true,
"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
"322c5ea53895 192.168.25.1:8787/tripleomaster/centos-binary-nova-placement-api:959e1d7f755ee681b6f23b498d262a9e4dd6326f_4cbb1814 \"kolla_start\" 2 minutes ago Up 2 minutes (unhealthy) nova_placement"
]
}
ok: [stack2-overcloud-controller-0] => {
"failed_when_result": false,
"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []
}
ok: [stack2-overcloud-compute-0] => {
"failed_when_result": false,
"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": []
}
NO MORE HOSTS LEFT
*************************************************************
inspecting placement_wsgi_error.log shows the first stack trace that the nova_placement database is missing the "traits" table:
[Sat Jul 28 10:17:06.525018 2018] [:error] [pid 14] [remote 10.1.20.15:0] mod_wsgi (pid=14): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module.
[Sat Jul 28 10:17:06.525067 2018] [:error] [pid 14] [remote 10.1.20.15:0] mod_wsgi (pid=14): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'.
[Sat Jul 28 10:17:06.525101 2018] [:error] [pid 14] [remote 10.1.20.15:0] Traceback (most recent call last):
[Sat Jul 28 10:17:06.525124 2018] [:error] [pid 14] [remote 10.1.20.15:0] File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module>
[Sat Jul 28 10:17:06.525165 2018] [:error] [pid 14] [remote 10.1.20.15:0] application = init_application()
[Sat Jul 28 10:17:06.525174 2018] [:error] [pid 14] [remote 10.1.20.15:0] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 88, in init_application
[Sat Jul 28 10:17:06.525198 2018] [:error] [pid 14] [remote 10.1.20.15:0] return deploy.loadapp(conf.CONF)
[Sat Jul 28 10:17:06.525205 2018] [:error] [pid 14] [remote 10.1.20.15:0] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/deploy.py", line 111, in loadapp
[Sat Jul 28 10:17:06.525300 2018] [:error] [pid 14] [remote 10.1.20.15:0] update_database()
[Sat Jul 28 10:17:06.525310 2018] [:error] [pid 14] [remote 10.1.20.15:0] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/deploy.py", line 92, in update_database
[Sat Jul 28 10:17:06.525329 2018] [:error] [pid 14] [remote 10.1.20.15:0] resource_provider.ensure_trait_sync(ctx)
[Sat Jul 28 10:17:06.525337 2018] [:error] [pid 14] [remote 10.1.20.15:0] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/objects/resource_provider.py", line 146, in ensure_trait_sync
[Sat Jul 28 10:17:06.526277 2018] [:error] [pid 14] [remote 10.1.20.15:0] _trait_sync(ctx)
...
[Sat Jul 28 10:17:06.531950 2018] [:error] [pid 14] [remote 10.1.20.15:0] raise errorclass(errno, errval)
[Sat Jul 28 10:17:06.532049 2018] [:error] [pid 14] [remote 10.1.20.15:0] ProgrammingError: (pymysql.err.ProgrammingError) (1146, u"Table 'nova_placement.traits' doesn't exist") [SQL: u'SELECT traits.name \\nFROM traits'] (Background on this error at: http://sqlalche.me/e/f405)
the log then continues with the following exception repeated every 30 seconds, which is likely the service trying to recover but it's hitting some problem with how it uses oslo.config in process:
[Sat Jul 28 10:18:36.916617 2018] [:error] [pid 14] [remote 10.1.20.15:148] mod_wsgi (pid=14): Target WSGI script '/var/www/cgi-bin/nova/nova-placement-api' cannot be loaded as Python module.
[Sat Jul 28 10:18:36.916646 2018] [:error] [pid 14] [remote 10.1.20.15:148] mod_wsgi (pid=14): Exception occurred processing WSGI script '/var/www/cgi-bin/nova/nova-placement-api'.
[Sat Jul 28 10:18:36.916664 2018] [:error] [pid 14] [remote 10.1.20.15:148] Traceback (most recent call last):
[Sat Jul 28 10:18:36.916681 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/var/www/cgi-bin/nova/nova-placement-api", line 54, in <module>
[Sat Jul 28 10:18:36.916711 2018] [:error] [pid 14] [remote 10.1.20.15:148] application = init_application()
[Sat Jul 28 10:18:36.916719 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 75, in init_application
[Sat Jul 28 10:18:36.916742 2018] [:error] [pid 14] [remote 10.1.20.15:148] _parse_args([], default_config_files=[conffile])
[Sat Jul 28 10:18:36.916748 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi.py", line 61, in _parse_args
[Sat Jul 28 10:18:36.916760 2018] [:error] [pid 14] [remote 10.1.20.15:148] logging.register_options(conf.CONF)
[Sat Jul 28 10:18:36.916766 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/python2.7/site-packages/oslo_log/log.py", line 250, in register_options
[Sat Jul 28 10:18:36.916779 2018] [:error] [pid 14] [remote 10.1.20.15:148] conf.register_cli_opts(_options.common_cli_opts)
[Sat Jul 28 10:18:36.916785 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2468, in __inner
[Sat Jul 28 10:18:36.916807 2018] [:error] [pid 14] [remote 10.1.20.15:148] result = f(self, *args, **kwargs)
[Sat Jul 28 10:18:36.916813 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2690, in register_cli_opts
[Sat Jul 28 10:18:36.916823 2018] [:error] [pid 14] [remote 10.1.20.15:148] self.register_cli_opt(opt, group, clear_cache=False)
[Sat Jul 28 10:18:36.916828 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2472, in __inner
[Sat Jul 28 10:18:36.916838 2018] [:error] [pid 14] [remote 10.1.20.15:148] return f(self, *args, **kwargs)
[Sat Jul 28 10:18:36.916843 2018] [:error] [pid 14] [remote 10.1.20.15:148] File "/usr/lib/python2.7/site-packages/oslo_config/cfg.py", line 2682, in register_cli_opt
[Sat Jul 28 10:18:36.916852 2018] [:error] [pid 14] [remote 10.1.20.15:148] raise ArgsAlreadyParsedError("cannot register CLI option")
[Sat Jul 28 10:18:36.916894 2018] [:error] [pid 14] [remote 10.1.20.15:148] ArgsAlreadyParsedError: arguments already parsed: cannot register CLI option
in nova_placement.yaml we can see that the placement service is started earlier than others, in step 3:
docker_config:
step_2:
get_attr: [NovaPlacementLogging, docker_config, step_2]
# start this early so it is up before computes start reporting
step_3:
nova_placement:
start_order: 1
image: {get_param: DockerNovaPlacementImage}
net: host
user: root
restart: always
however in nova-api.yaml we can see that the api DB sync is also in
step_3, which although it includes start_order: 0, I'm going to guess
that the "start_order" system is not coordinated across controllers:
docker_config:
step_2:
get_attr: [NovaApiLogging, docker_config, step_2]
step_3:
nova_api_db_sync:
start_order: 0
image: &nova_api_image {get_param: DockerNovaApiImage}
net: host
detach: false
user: root
volumes: &nova_api_bootstrap_volumes
list_concat:
- {get_attr: [ContainersCommon, volumes]}
- {get_attr: [NovaApiLogging, volumes]}
-
- /var/lib/config-data/nova/etc/my.cnf.d/tripleo.cnf:/etc/my.cnf.d/tripleo.cnf:ro
- /var/lib/config-data/nova/etc/nova/:/etc/nova/:ro
command: "/usr/bin/bootstrap_host_exec nova_api su nova -s /bin/bash -c '/usr/bin/nova-manage api_db sync'"
the placement service always starts correctly on the "bootstrap" controller since start_order is coordinating, but this does not carry over to the other two controllers.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1784155/+subscriptions