← Back to team overview

openstack team mailing list archive

[nova] mysql connection optimization

 

Hi all,

I've been taking a look at the way Nova uses its MySQL database. Having
done MySQL performance audits for years as a consultant, a few things
jumped out right away at me. First is the way that SQLAlchemy is wrapping
nearly every query in an unnecessary "ping check" and rollback, eg.:

select 1;
> select ... from ... where ...;
> rollback;
> select 1;
> update ... where ...;
> commit;
> rollback;


You can find a complete sample here: http://paste.openstack.org/show/18731/

I think I understand the reason for both the "select 1" and the "rollback"
statements. However, in the interest of performance for large-scale
deployments, I feel pretty strongly that they need to go away.

As I see it, there are three factors here.

(I) Most of the code in db/sqlalchemy/api.py abstracts a "unit of work" to
a very low level, generally a single database read or write, and does not
currently support transactions spanning multiple consistent writes. This is
why "select 1" and "rollback" appear around almost every query -- most
functions in api.py is checking out a session object, doing one or two
queries, and then releasing the session. This actually creates a much
larger issue about transactional atomicity for larger operations, such as
the open bug about network creation here, and is probably better for
another discussion.
https://bugs.launchpad.net/nova/+bug/755138.

(II) Connections are tested by the MySQLPingListener function, using
"SELECT 1" statements, every time a connection / session is checked out of
the pool. Since connections are usually OK, this adds overhead
unnecessarily. It would be more efficient to handle the errors when
connections aren't OK. I've opened a bug with a description of one possible
way to fix this issue. There are probably other viable solutions as well.
https://bugs.launchpad.net/nova/+bug/1007027

My understanding is that this was implemented to prevent "Database has gone
away" errors from occurring every time that the database is restarted, or
when connections are closed by the database after being idle for long
periods. In my opinion, a better solution to these problems is to:
* wrap queries in retry logic, which will catch disconnect errors and
attempt to reconnect to the database. I have a patch for this, if folks are
interested.
* set Nova's sql_idle_timeout to less than MySQL's wait_timeout, and set
them both to a reasonably short interval (eg, 1 minute), rather than the
default (which I think is 8 hours).

(III) Transaction state is reset when connections are returned to the pool,
even if no transaction was in progress, or the transaction-in-progress
already committed. This is completely wasteful, and easy to disable.
https://bugs.launchpad.net/nova/+bug/1007038

Caveat here is that this reset-on-return functionality *is* useful when
code doesn't properly clean up its own transaction state. When I turned it
off, it exposed a few bugs, which I haven't tracked down yet. Lowering the
sql_idle_timeout will provide the same "solution" as the current
reset-on-return behavior in that it will prevent long-running idle
transactions that tie up resources unnecessarily.


In summary, I'd like to see Nova stop spamming the database with "SELECT 1"
and  "ROLLBACK", and think this should be pretty easy to do. Testing these
two changes in my devstack seems to work fine, but I'd like to know what
others think before I propose a changeset with this kind of potential
impact.

Cheers,
Devananda

Follow ups