← Back to team overview

yahoo-eng-team team mailing list archive

[Bug 1718409] [NEW] Failed to reschedule instance in cell.

 

Public bug reported:

Description
===========
When instance rescheduling happens in cell:
1. Nova compute casts to the build_instances method in cell1 conductor.
2. In cell1 conductor, build_instances method calls the scheduler_client.select_destinations method, it calls to scheduler for getting scheduled host **without cell0 targeting**, and then, wait scheduler reply.

Nobody could get this scheduler message in cell1 mq, because Scheduler
just only lisening to the api cell(cell0) mq.

flow like this:
compute --> (reschedule) -->  cell1-conductor  -->  (call rpc to CELL1) --> timeout

It seems conductor shouldn't call this message to CELL1 mq. Maybe, we
might transfer the mq connection to api cell mq(target cell0) before
call.

After this, the flow as blow:
compute --> (reschedule) -->  cell1-conductor  --> target CELL0 -->  (call rpc to CELL0) --> scheduler

BTW, a devstack issue founded.
In order to let scheduler could get instance group from nova_api database in code here: 
https://github.com/openstack/nova/blob/5bf1bb47c7e17c26592a699d07c2faa59d98bfb8/nova/conductor/manager.py#L623
It seems we need add 'connection' config in [api_database] section, but now devstack doesn't deal with it.
patch is here: https://review.openstack.org/#/c/505607/

Steps to reproduce
==================

* 1. Inject some exception in compute code to reproduce a reschedule operation:
raise exception.RescheduledException("reschedule trigger") in here:
https://github.com/openstack/nova/blob/5bf1bb47c7e17c26592a699d07c2faa59d98bfb8/nova/compute/manager.py#L1951
* 2. nova compute cast reschedule request to cell-conductor
* 3. cell-conductor calls to cell mq, nobody could process that.
* 4. Instacne state stuck in BUILD(scheduling)
stack@yikun:~/nova$ nova list
+--------------------------------------+---------------------+--------+------------+-------------+----------+
| ID                                   | Name                | Status | Task State | Power State | Networks |
+--------------------------------------+---------------------+--------+------------+-------------+----------+
| af3fe405-7155-429d-97a8-a0f9f4c82911 | test_resch_instance | BUILD  | scheduling | NOSTATE     |          |
+--------------------------------------+---------------------+--------+------------+-------------+----------+
* 4. Finally, Instance state is ERROR.
stack@yikun:~/nova$ nova list
+--------------------------------------+---------------------+--------+------------+-------------+----------+
| ID                                   | Name                | Status | Task State | Power State | Networks |
+--------------------------------------+---------------------+--------+------------+-------------+----------+
| af3fe405-7155-429d-97a8-a0f9f4c82911 | test_resch_instance | ERROR  | -          | NOSTATE     |          |
+--------------------------------------+---------------------+--------+------------+-------------+----------+


Logs & Configs
==============
cell config show as blow:

stack@yikun:~/nova$ cat /etc/nova/nova_cell1.conf 
[database]
connection = mysql+pymysql://root:1@127.0.0.1/nova_cell1?charset=utf8

[api_database]
connection = mysql+pymysql://root:1@127.0.0.1/nova_api?charset=utf8


[conductor]
workers = 2

[DEFAULT]
logging_user_identity_format = %(project_name)s %(user_name)s
transport_url = rabbit://stackrabbit:1@XXX:5672/nova_cell1
debug = True


Log in cell conductor:
2017-09-20 06:18:01.306 21701 ERROR nova.scheduler.utils [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] [instance: af3fe405-7155-429d-97a8-a0f9f4c82911] Error from last host: yikun (node yikun): [u'Traceback (most recent call last):\n', u'  File "/opt/stack/nova/nova/compute/manager.py", line 1826, in _do_build_and_run_instance\n    filter_properties)\n', u'  File "/opt/stack/nova/nova/compute/manager.py", line 2057, in _build_and_run_instance\n    instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance af3fe405-7155-429d-97a8-a0f9f4c82911 was re-scheduled: reschedule trigger\n']
2017-09-20 06:18:01.347 21701 DEBUG oslo_db.sqlalchemy.engines [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py:285
2017-09-20 06:19:01.433 21701 WARNING nova.scheduler.utils [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] Retrying select_destinations after a MessagingTimeout, attempt 1 of 2.: MessagingTimeout: Timed out waiting for a reply to message ID 5dd1a54aa6f446a696f3ab0307572f3f
2017-09-20 06:20:01.441 21701 WARNING nova.scheduler.utils [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] Retrying select_destinations after a MessagingTimeout, attempt 2 of 2.: MessagingTimeout: Timed out waiting for a reply to message ID c27726fbfcd64211a2a1eb64d05c98f4
2017-09-20 06:21:01.457 21701 WARNING nova.scheduler.utils [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] Failed to compute_task_build_instances: Timed out waiting for a reply to message ID 5885e65436b14658a857134596cbf277: MessagingTimeout: Timed out waiting for a reply to message ID 5885e65436b14658a857134596cbf277
2017-09-20 06:21:01.458 21701 WARNING nova.scheduler.utils [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] [instance: af3fe405-7155-429d-97a8-a0f9f4c82911] Setting instance to ERROR state.: MessagingTimeout: Timed out waiting for a reply to message ID 5885e65436b14658a857134596cbf277

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1718409

Title:
  Failed to reschedule instance in cell.

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  When instance rescheduling happens in cell:
  1. Nova compute casts to the build_instances method in cell1 conductor.
  2. In cell1 conductor, build_instances method calls the scheduler_client.select_destinations method, it calls to scheduler for getting scheduled host **without cell0 targeting**, and then, wait scheduler reply.

  Nobody could get this scheduler message in cell1 mq, because Scheduler
  just only lisening to the api cell(cell0) mq.

  flow like this:
  compute --> (reschedule) -->  cell1-conductor  -->  (call rpc to CELL1) --> timeout

  It seems conductor shouldn't call this message to CELL1 mq. Maybe, we
  might transfer the mq connection to api cell mq(target cell0) before
  call.

  After this, the flow as blow:
  compute --> (reschedule) -->  cell1-conductor  --> target CELL0 -->  (call rpc to CELL0) --> scheduler

  BTW, a devstack issue founded.
  In order to let scheduler could get instance group from nova_api database in code here: 
  https://github.com/openstack/nova/blob/5bf1bb47c7e17c26592a699d07c2faa59d98bfb8/nova/conductor/manager.py#L623
  It seems we need add 'connection' config in [api_database] section, but now devstack doesn't deal with it.
  patch is here: https://review.openstack.org/#/c/505607/

  Steps to reproduce
  ==================

  * 1. Inject some exception in compute code to reproduce a reschedule operation:
  raise exception.RescheduledException("reschedule trigger") in here:
  https://github.com/openstack/nova/blob/5bf1bb47c7e17c26592a699d07c2faa59d98bfb8/nova/compute/manager.py#L1951
  * 2. nova compute cast reschedule request to cell-conductor
  * 3. cell-conductor calls to cell mq, nobody could process that.
  * 4. Instacne state stuck in BUILD(scheduling)
  stack@yikun:~/nova$ nova list
  +--------------------------------------+---------------------+--------+------------+-------------+----------+
  | ID                                   | Name                | Status | Task State | Power State | Networks |
  +--------------------------------------+---------------------+--------+------------+-------------+----------+
  | af3fe405-7155-429d-97a8-a0f9f4c82911 | test_resch_instance | BUILD  | scheduling | NOSTATE     |          |
  +--------------------------------------+---------------------+--------+------------+-------------+----------+
  * 4. Finally, Instance state is ERROR.
  stack@yikun:~/nova$ nova list
  +--------------------------------------+---------------------+--------+------------+-------------+----------+
  | ID                                   | Name                | Status | Task State | Power State | Networks |
  +--------------------------------------+---------------------+--------+------------+-------------+----------+
  | af3fe405-7155-429d-97a8-a0f9f4c82911 | test_resch_instance | ERROR  | -          | NOSTATE     |          |
  +--------------------------------------+---------------------+--------+------------+-------------+----------+

  
  Logs & Configs
  ==============
  cell config show as blow:

  stack@yikun:~/nova$ cat /etc/nova/nova_cell1.conf 
  [database]
  connection = mysql+pymysql://root:1@127.0.0.1/nova_cell1?charset=utf8

  [api_database]
  connection = mysql+pymysql://root:1@127.0.0.1/nova_api?charset=utf8

  
  [conductor]
  workers = 2

  [DEFAULT]
  logging_user_identity_format = %(project_name)s %(user_name)s
  transport_url = rabbit://stackrabbit:1@XXX:5672/nova_cell1
  debug = True

  
  Log in cell conductor:
  2017-09-20 06:18:01.306 21701 ERROR nova.scheduler.utils [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] [instance: af3fe405-7155-429d-97a8-a0f9f4c82911] Error from last host: yikun (node yikun): [u'Traceback (most recent call last):\n', u'  File "/opt/stack/nova/nova/compute/manager.py", line 1826, in _do_build_and_run_instance\n    filter_properties)\n', u'  File "/opt/stack/nova/nova/compute/manager.py", line 2057, in _build_and_run_instance\n    instance_uuid=instance.uuid, reason=six.text_type(e))\n', u'RescheduledException: Build of instance af3fe405-7155-429d-97a8-a0f9f4c82911 was re-scheduled: reschedule trigger\n']
  2017-09-20 06:18:01.347 21701 DEBUG oslo_db.sqlalchemy.engines [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py:285
  2017-09-20 06:19:01.433 21701 WARNING nova.scheduler.utils [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] Retrying select_destinations after a MessagingTimeout, attempt 1 of 2.: MessagingTimeout: Timed out waiting for a reply to message ID 5dd1a54aa6f446a696f3ab0307572f3f
  2017-09-20 06:20:01.441 21701 WARNING nova.scheduler.utils [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] Retrying select_destinations after a MessagingTimeout, attempt 2 of 2.: MessagingTimeout: Timed out waiting for a reply to message ID c27726fbfcd64211a2a1eb64d05c98f4
  2017-09-20 06:21:01.457 21701 WARNING nova.scheduler.utils [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] Failed to compute_task_build_instances: Timed out waiting for a reply to message ID 5885e65436b14658a857134596cbf277: MessagingTimeout: Timed out waiting for a reply to message ID 5885e65436b14658a857134596cbf277
  2017-09-20 06:21:01.458 21701 WARNING nova.scheduler.utils [req-5890b731-4b48-4503-a446-ef3f858eda4a admin admin] [instance: af3fe405-7155-429d-97a8-a0f9f4c82911] Setting instance to ERROR state.: MessagingTimeout: Timed out waiting for a reply to message ID 5885e65436b14658a857134596cbf277

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1718409/+subscriptions


Follow ups