← Back to team overview

openstack team mailing list archive

Re: Caching strategies in Nova ...

 

This is great: hard numbers are exactly what we need.  I would love to see
a statement-by-statement SQL log with timings from someone that has a
performance issue.  I'm happy to look into any DB problems that
demonstrates.

The nova database is small enough that it should always be in-memory (if
you're running a million VMs, I don't think asking for one gigabyte of RAM
on your DB is unreasonable!)

If it isn't hitting disk, PostgreSQL or MySQL with InnoDB can serve 10k
'indexed' requests per second through SQL on a low-end (<$1000) box.  With
tuning you can get 10x that.  Using one of the SQL bypass engines (e.g.
MySQL HandlerSocket) can supposedly give you 10x again.  Throwing money at
the problem in the form of multi-processor boxes (or disks if you're I/O
bound) can probably get you 10x again.

However, if you put a DB on a remote host, you'll have to wait for a
network round-trip per query.  If your ORM is doing a 1+N query, the total
read time will be slow.  If your DB is doing a sync on every write, writes
will be slow.  If the DB isn't tuned with a sensible amount of cache (at
least as big as the DB size), it will be slow(er).  Each of these has a
very simple fix for OpenStack.

Relational databases have very efficient caching mechanisms built in.  Any
out-of-process cache will have a hard time beating it.  Let's make sure the
bottleneck is the DB, and not (for example) RabbitMQ, before we go off a
huge rearchitecture.

Justin



On Thu, Mar 22, 2012 at 7:53 PM, Mark Washenberger <
mark.washenberger@xxxxxxxxxxxxx> wrote:

> Working on this independently, I created a branch with some simple
> performance logging around the nova-api, and individually around
> glance, nova.db, and nova.rpc calls. (Sorry, I only have a local
> copy and its on a different computer right now, and probably needs
> a rebase. I will rebase and publish it on GitHub tomorrow.)
>
> With this logging, I could get some simple profiling that I found
> very useful. Here is a GH project with the analysis code as well
> as some nova-api logs I was using as input.
>
> https://github.com/markwash/nova-perflog
>
> With these tools, you can get a wall-time profile for individual
> requests. For example, looking at one server create request (and
> you can run this directly from the checkout as the logs are saved
> there):
>
> markw@poledra:perflogs$ cat nova-api.vanilla.1.5.10.log | python
> profile-request.py req-3cc0fe84-e736-4441-a8d6-ef605558f37f
> key                                        count    avg
> nova.api.openstack.wsgi.POST                   1  0.657
> nova.db.api.instance_update                    1  0.191
> nova.image.show                                1  0.179
> nova.db.api.instance_add_security_group        1  0.082
> nova.rpc.cast                                  1  0.059
> nova.db.api.instance_get_all_by_filters        1  0.034
> nova.db.api.security_group_get_by_name         2  0.029
> nova.db.api.instance_create                    1  0.011
> nova.db.api.quota_get_all_by_project           3  0.003
> nova.db.api.instance_data_get_for_project      1  0.003
>
> key                      count  total
> nova.api.openstack.wsgi      1  0.657
> nova.db.api                 10  0.388
> nova.image                   1  0.179
> nova.rpc                     1  0.059
>
> All times are in seconds. The nova.rpc time is probably high
> since this was the first call since server restart, so the
> connection handshake is probably included. This is also probably
> 1.5 months stale.
>
> The conclusion I reached from this profiling is that we just plain
> overuse the db (and we might do the same in glance). For example,
> whenever we do updates, we actually re-retrieve the item from the
> database, update its dictionary, and save it. This is double the
> cost it needs to be. We also handle updates for data across tables
> inefficiently, where they could be handled in single database round
> trip.
>
> In particular, in the case of server listings, extensions are just
> rough on performance. Most extensions hit the database again
> at least once. This isn't really so bad, but it clearly is an area
> where we should improve, since these are the most frequent api
> queries.
>
> I just see a ton of specific performance problems that are easier
> to address one by one, rather than diving into a general (albeit
> obvious) solution such as caching.
>
>
> "Sandy Walsh" <sandy.walsh@xxxxxxxxxxxxx> said:
>
> > We're doing tests to find out where the bottlenecks are, caching is the
> > most obvious solution, but there may be others. Tools like memcache do a
> > really good job of sharing memory across servers so we don't have to
> > reinvent the wheel or hit the db at all.
> >
> > In addition to looking into caching technologies/approaches we're gluing
> > together some tools for finding those bottlenecks. Our first step will
> > be finding them, then squashing them ... however.
> >
> > -S
> >
> > On 03/22/2012 06:25 PM, Mark Washenberger wrote:
> >> What problems are caching strategies supposed to solve?
> >>
> >> On the nova compute side, it seems like streamlining db access and
> >> api-view tables would solve any performance problems caching would
> >> address, while keeping the stale data management problem small.
> >>
> >
> > _______________________________________________
> > Mailing list: https://launchpad.net/~openstack
> > Post to     : openstack@xxxxxxxxxxxxxxxxxxx
> > Unsubscribe : https://launchpad.net/~openstack
> > More help   : https://help.launchpad.net/ListHelp
> >
>
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp
>

Follow ups

References