openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #09090
Performance diagnosis of metadata query
The performance of the metadata query with cloud-init has been causing some
people problems (it's so slow cloud-init times out!), and has led to the
suggestion that we need lots of caching. (My hypothesis is that we
don't...)
By turning on SQL debugging in SQL Alchemy (for which I've proposed a patch
for Essex: https://review.openstack.org/#change,5783), I was able to
capture the SQL statements.
I'm focusing on the SQL statements for the metadata call.
The code does this:
1) Checks the cache to see if it has the data
2) Makes a message-bus call to the network service to get the fixed_ip info
from the address
3) Looks up all sort of metadata in the database
4) Formats the reply
#1 means that the first call is slower than the others, so we need to focus
on the first call.
#2 could be problematic, if the message queue is overloaded or if the
network service is slow to response
#3 could be problematic if the DB isn't working properly
#4 is hopefully not the problem.
The relevant SQL log from the API server:
http://paste.openstack.org/show/12109/
And from the network server: http://paste.openstack.org/show/12116/
I've analyzed each of the SQL statements:
API
http://paste.openstack.org/show/12110/ (Need to check that there isn't a
table scan when instance_info_caches is large)
http://paste.openstack.org/show/12111/ Table scan on services table, but
this is presumably smallish
http://paste.openstack.org/show/12112/ No index. Table scan on s3_images
table. Also this table is MyISAM. Also seems to insert rows on the first
call (not shown). Evil.
http://paste.openstack.org/show/12113/
http://paste.openstack.org/show/12114/ block_device_mapping is MyISAM.
Network
http://paste.openstack.org/show/12117/
http://paste.openstack.org/show/12118/ (Fetch virtual_interface by
instance_id)
http://paste.openstack.org/show/12119/ (Fetch network by id)
http://paste.openstack.org/show/12120/ Missing index => table scan on
networks. Unnecessary row re-fetch.
http://paste.openstack.org/show/12121/ Missing index => table scan on
virtual_interfaces. Unnecessary row-refetch.
http://paste.openstack.org/show/12122/ (Fetch fixed_ips on virtual
interface)
http://paste.openstack.org/show/12123/ Missing index => table scan on
networks. Unnecessary row re-fetch. (Double re-fetch. What does it mean?)
http://paste.openstack.org/show/12124/ Missing index => table scan on
virtual_interfaces. Another re-re-fetch.
http://paste.openstack.org/show/12125/ Missing index => table scan on
fixed_ips (Uh-oh - I hope you didn't allocate a /8!!). We do have this row
from the virtual interface lookup; perhaps we could remove this query?
http://paste.openstack.org/show/12126/
http://paste.openstack.org/show/12127/
We still have a bunch of MyISAM tables (at least with a devstack install):
http://paste.openstack.org/show/12115/
As I see it, these are the issues (in sort of priority order):
*Critical*
Table scan of fixed_ips on the network service (row per IP address?)
Use of MyISAM tables, particularly for s3_images and block_device_mapping
Table scan of virtual_interfaces (row per instance?)
Verify that MySQL isn't doing a table scan on
http://paste.openstack.org/show/12110/ when # of instances is large
*Naughty*
*(Mostly because the tables are small)*
Table scan of s3_images
Table scan of services
Table scan of networks
*Low importance*
*(Re-fetches aren't a big deal if the queries are fast)*
Row re-fetches & re-re-fetches
My install in nowhere near big enough for any of these to actually cause a
real problem, so I'd love to get timings / a log from someone that is
having a problem. Even the table scan of fixed_ips should be OK if you
have enough RAM.
Justin
Follow ups