yahoo-eng-team team mailing list archive

Thread
Date

[Bug 1297375] [NEW] All nova apis relying on Instance.save(expected_*_state) for safety contain a race condition

To: yahoo-eng-team@xxxxxxxxxxxxxxxxxxx
From: Matthew Booth <mbooth@xxxxxxxxxx>
Date: Tue, 25 Mar 2014 15:55:47 -0000
Reply-to: Bug 1297375 <1297375@xxxxxxxxxxxxxxxxxx>
Sender: bounces@xxxxxxxxxxxxx

Public bug reported:

Take, for example, resize_instance(). In manager.py, we assert that the
instance is in RESIZE_PREP state with:

  instance.save(expected_task_state=task_states.RESIZE_PREP)

This should mean that the first resize will succeed, and any subsequent
will fail. However, the underlying db implementation does not lock the
instance during the update, and therefore doesn't guarantee this.

Specifically, _instance_update() in db/sqlalchemy/apy.py starts a
session, and reads task_state from the instance. However, it does not
use a 'select ... for update', meaning the row is not locked. 2
concurrent calls to this method can both read the same state, then race
to the update. The last writer will win. Without 'select ... for
update', the db transaction is only ensuring that all writes are atomic,
not reads with dependent writes.

SQLAlchemy seems to support select ... for update, as do MySQL and
PostgreSQL, although MySQL will fall back to whole table locks for non-
InnoDB tables, which would likely be a significant performance hit.

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1297375

Title:
  All nova apis relying on Instance.save(expected_*_state) for safety
  contain a race condition

Status in OpenStack Compute (Nova):
  New

Bug description:
  Take, for example, resize_instance(). In manager.py, we assert that
  the instance is in RESIZE_PREP state with:

    instance.save(expected_task_state=task_states.RESIZE_PREP)

  This should mean that the first resize will succeed, and any
  subsequent will fail. However, the underlying db implementation does
  not lock the instance during the update, and therefore doesn't
  guarantee this.

  Specifically, _instance_update() in db/sqlalchemy/apy.py starts a
  session, and reads task_state from the instance. However, it does not
  use a 'select ... for update', meaning the row is not locked. 2
  concurrent calls to this method can both read the same state, then
  race to the update. The last writer will win. Without 'select ... for
  update', the db transaction is only ensuring that all writes are
  atomic, not reads with dependent writes.

  SQLAlchemy seems to support select ... for update, as do MySQL and
  PostgreSQL, although MySQL will fall back to whole table locks for
  non-InnoDB tables, which would likely be a significant performance
  hit.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1297375/+subscriptions

Follow ups

[Bug 1297375] Re: All nova apis relying on Instance.save(expected_*_state) for safety contain a race condition
From: Thierry Carrez, 2015-09-03
[Bug 1297375] Re: All nova apis relying on Instance.save(expected_*_state) for safety contain a race condition
From: John Garbutt, 2015-08-05
[Bug 1297375] Re: All nova apis relying on Instance.save(expected_*_state) for safety contain a race condition
From: Thierry Carrez, 2015-07-28
[Bug 1297375] [NEW] All nova apis relying on Instance.save(expected_*_state) for safety contain a race condition
From: Matthew Booth, 2014-03-25

References

[Bug 1297375] [NEW] All nova apis relying on Instance.save(expected_*_state) for safety contain a race condition
From: Matthew Booth, 2014-03-25