← Back to team overview

openstack team mailing list archive

Re: AggregateInstanceExtraSpecs very slow?

 

On Tue, Feb 26, 2013 at 3:15 PM, Sam Morrison <sorrison@xxxxxxxxx> wrote:
>
> On 26/02/2013, at 2:15 PM, Chris Behrens <cbehrens@xxxxxxxxxxxx> wrote:
>
>>
>> On Feb 25, 2013, at 6:39 PM, Joe Gordon <jogo@xxxxxxxxxxxxxxxx> wrote:
>>
>>>
>>> It looks like the scheduler issues are related to the rabbitmq issues.   "host 'qh2-rcc77' ... is disabled or has not been heard from in a while"
>>>
>>> What does 'nova host-list' say?   the clocks must all be synced up?
>>
>> Good things to check.  It feels like something is spinning way too much within this filter, though.  This can also cause the above message.  The scheduler pulls all of the records before it starts filtering… and if there's a huge delay somewhere, it can start seeing a bunch of hosts as disabled.
>>
>> The filter doesn't look like a problem.. unless there's a large amount of aggregate metadata… and/or a large amount of key/values for the instance_type's extra specs.   There *is* a DB call in the filter.  If that's blocking for an extended period of time, the whole process is blocked…  But I suspect by the '100% cpu' comment, that this is not the case…  So the only thing I can think of is that it returns a tremendous amount of metadata.
>>
>> Adding some extra logging in the filter could be useful.
>>
>> - Chris
>
> Thanks Chris, I have 2 aggregates and 2 keys defined and each of the 80 hosts has either one or the other. At the moment every flavour has either one or the other too so I don't think it's too much data.
>
> I've tracked it down to this call:
>
> metadata = db.aggregate_metadata_get_by_host(context, host_state.host)

More debugging has got it down to a query

In db.api.aggregate_metadata_get_by_host:

    query = model_query(context, models.Aggregate).join(
            "_hosts").filter(models.AggregateHost.host == host).join(
            "_metadata")
   ......
   rows = query.all()

With query debug on this resolves to:

SELECT aggregates.created_at AS aggregates_created_at,
aggregates.updated_at AS aggregates_updated_at, aggregates.deleted_at
AS aggregates_deleted_at, aggregates.deleted AS aggregates_deleted,
aggregates.id AS aggregates_id, aggregates.name AS aggregates_name,
aggregates.availability_zone AS aggregates_availability_zone,
aggregate_hosts_1.created_at AS aggregate_hosts_1_created_at,
aggregate_hosts_1.updated_at AS aggregate_hosts_1_updated_at,
aggregate_hosts_1.deleted_at AS aggregate_hosts_1_deleted_at,
aggregate_hosts_1.deleted AS aggregate_hosts_1_deleted,
aggregate_hosts_1.id AS aggregate_hosts_1_id, aggregate_hosts_1.host
AS aggregate_hosts_1_host, aggregate_hosts_1.aggregate_id AS
aggregate_hosts_1_aggregate_id FROM aggregates INNER JOIN
aggregate_hosts AS aggregate_hosts_2 ON aggregates.id =
aggregate_hosts_2.aggregate_id AND aggregate_hosts_2.deleted = 0 AND
aggregates.deleted = 0 INNER JOIN aggregate_hosts ON
aggregate_hosts.aggregate_id = aggregates.id AND
aggregate_hosts.deleted = 0 AND aggregates.deleted = 0 INNER JOIN
aggregate_metadata AS aggregate_metadata_1 ON aggregates.id =
aggregate_metadata_1.aggregate_id AND aggregate_metadata_1.deleted = 0
AND aggregates.deleted = 0 INNER JOIN aggregate_metadata ON
aggregate_metadata.aggregate_id = aggregates.id AND
aggregate_metadata.deleted = 0 AND aggregates.deleted = 0 LEFT OUTER
JOIN aggregate_hosts AS aggregate_hosts_3 ON aggregates.id =
aggregate_hosts_3.aggregate_id AND aggregate_hosts_3.deleted = 0 AND
aggregates.deleted = 0 LEFT OUTER JOIN aggregate_hosts AS
aggregate_hosts_1 ON aggregate_hosts_1.aggregate_id = aggregates.id
AND aggregate_hosts_1.deleted = 0 AND aggregates.deleted = 0 WHERE
aggregates.deleted = 0 AND aggregate_hosts.host = 'qh2-rcc34';

Which in our case returns 328509 rows in set (25.97 sec)

Seems a bit off considering there are 80 rows in aggregate_hosts, 2
rows in aggregates and 2 rows in aggregate_metadata

In the code rows is only equal to 1 so it seems to be doing something
inside to code to do this? Don't know too much how sqlalchemy works.

Seems like a bug to me? or maybe our database has something wrong in it?

Cheers,
Sam


Follow ups

References