yahoo-eng-team team mailing list archive
-
yahoo-eng-team team
-
Mailing list archive
-
Message #43700
[Bug 1528834] [NEW] HostManager host_state_map should have synchronized access
Public bug reported:
Reporting this as a bug to get some discussion rolling on how best to
fix it (or whether to fix it). I kind of assume this is a well known
problem, but since the code isn't covered with NOTEs about it, I thought
perhaps best to throw it up here.
In nova's master branch as of Dec 2015 the HostState objects and
host_state_map are not robust in the face of concurrent access. That
concurrent access doesn't happen frequently under test situations is
primarily the fault of the test situations and luck, not the correctness
of the code. Unless I'm completely misunderstanding the way in which
eventlet is being used, each of the following methods has the potential
to yield on I/O (rpc calls, log messages) or sleeps (not used, but handy
in debugging situations). This is not an exhaustive list, just what came
up in my digging in nova/scheduler/host_manager.py (digging described in
more detail below):
* HostManager.get_all_host_states
* HostState.consume_from_request
* HostState.update_from_compute_node
If these yield at exactly the wrong time, an ongoing set of
select_destinations calls will have some which have incorrect host
information. As we know, this is often true anyway, but the suggested
fix (below) is pretty lightweight, so perhaps a small improvement is
worth it?
How I figured this out:
I was reading the code and realized that access to the HostState is not
synchronized so tried to see if I could come up with a way to get
incorrect information to show up at the scheduler when running just one
scheduler, just one compute node and doing multiple "concurrent" server
creates (calling three creates over the API in a backgrounding loop). It
was pretty easy to report some incorrect RAM usage. To make things fail
more reliably I introduced small sleeps into consume_from_request to
force it to yield. I managed to get very incorrect usage information
(negative RAM).
Adding a utils.synchronized decorator to consume_from_request improved
the situation somewhat: At the end of the scheduling run the RAM usage
was off by one deduction. Adding a synchronized to get_all_host_states
cleared this up.
I'm not clear on the implications of these changes: they make sense to
me (we don't want to concurrent access to a shared data structure) but I
wonder if perhaps there are other things to consider that I'm not aware
of such as "well, actually, this stuff was supposed to be written so it
never yielded, that they can or may is the bug" (in which case the
methods need some phat warnings on them for future maintainers).
I'm also not clear on the semaphores that ought to be used (if
synchronization is the way to go). In my POC solution
get_all_host_states had its own semaphore while the other two methods
shared one.
Beyond that, though, it might make sense for the host_state_map and the
entire HostState object to be "thread" safe, to get all places where
this could be a problem rather than piecemeal dealing with methods as
curious people notice them.
** Affects: nova
Importance: Undecided
Status: New
** Tags: scheduler
--
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1528834
Title:
HostManager host_state_map should have synchronized access
Status in OpenStack Compute (nova):
New
Bug description:
Reporting this as a bug to get some discussion rolling on how best to
fix it (or whether to fix it). I kind of assume this is a well known
problem, but since the code isn't covered with NOTEs about it, I
thought perhaps best to throw it up here.
In nova's master branch as of Dec 2015 the HostState objects and
host_state_map are not robust in the face of concurrent access. That
concurrent access doesn't happen frequently under test situations is
primarily the fault of the test situations and luck, not the
correctness of the code. Unless I'm completely misunderstanding the
way in which eventlet is being used, each of the following methods has
the potential to yield on I/O (rpc calls, log messages) or sleeps (not
used, but handy in debugging situations). This is not an exhaustive
list, just what came up in my digging in
nova/scheduler/host_manager.py (digging described in more detail
below):
* HostManager.get_all_host_states
* HostState.consume_from_request
* HostState.update_from_compute_node
If these yield at exactly the wrong time, an ongoing set of
select_destinations calls will have some which have incorrect host
information. As we know, this is often true anyway, but the suggested
fix (below) is pretty lightweight, so perhaps a small improvement is
worth it?
How I figured this out:
I was reading the code and realized that access to the HostState is
not synchronized so tried to see if I could come up with a way to get
incorrect information to show up at the scheduler when running just
one scheduler, just one compute node and doing multiple "concurrent"
server creates (calling three creates over the API in a backgrounding
loop). It was pretty easy to report some incorrect RAM usage. To make
things fail more reliably I introduced small sleeps into
consume_from_request to force it to yield. I managed to get very
incorrect usage information (negative RAM).
Adding a utils.synchronized decorator to consume_from_request improved
the situation somewhat: At the end of the scheduling run the RAM usage
was off by one deduction. Adding a synchronized to get_all_host_states
cleared this up.
I'm not clear on the implications of these changes: they make sense to
me (we don't want to concurrent access to a shared data structure) but
I wonder if perhaps there are other things to consider that I'm not
aware of such as "well, actually, this stuff was supposed to be
written so it never yielded, that they can or may is the bug" (in
which case the methods need some phat warnings on them for future
maintainers).
I'm also not clear on the semaphores that ought to be used (if
synchronization is the way to go). In my POC solution
get_all_host_states had its own semaphore while the other two methods
shared one.
Beyond that, though, it might make sense for the host_state_map and
the entire HostState object to be "thread" safe, to get all places
where this could be a problem rather than piecemeal dealing with
methods as curious people notice them.
To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1528834/+subscriptions
Follow ups