openstack team mailing list archive

Thread
Date

Re: AggregateInstanceExtraSpecs very slow?

To: Joe Gordon <jogo@xxxxxxxxxxxxxxxx>
From: Sam Morrison <sorrison@xxxxxxxxx>
Date: Tue, 26 Feb 2013 14:06:08 +1100
Cc: "openstack@xxxxxxxxxxxxxxxxxxx list" <openstack@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CAKe5d-SL6bSZ2YyFT3r2yUWJ6aDAC66G8kd0RZEV5TWt6w5ruA@mail.gmail.com>

Hi Joe,

On 26/02/2013, at 1:39 PM, Joe Gordon <jogo@xxxxxxxxxxxxxxxx> wrote:

> 
> 
> On Mon, Feb 25, 2013 at 6:14 PM, Sam Morrison <sorrison@xxxxxxxxx> wrote:
> Hi Joe,
> 
> On 26/02/2013, at 11:19 AM, Joe Gordon <jogo@xxxxxxxxxxxxxxxx> wrote:
> 
>> On Sun, Feb 24, 2013 at 3:31 PM, Sam Morrison <sorrison@xxxxxxxxx> wrote:
>> I have been playing with the AggregateInstanceExtraSpecs filter and can't get it to work.
>> 
>> In our staging environment it works fine with 4 compute nodes, I have 2 aggregates to split them into 2.
>> 
>> When I try to do the same in our production environment which has 80 compute nodes (splitting them again into 2 aggregates) it doesn't work.
>> 
>> nova-scheduler starts to go very slow,  I scheduled an instance and gave up after 5 minutes, it seemed to be taking ages and the host was at 100% cpu. Also got about 500 messages in rabbit that were unacknowledged.
>> 
>> 
>> what does the nova-scheduler log say?  Where is the unacknowledged rabbitmq messages sent from?
> 
> Logs are below. Note the large time gap between selecting a host, this is pretty much instantaneous without this filter.
> 
> Can't figure out how to see an unacknowledged message in rabbit but my guess is it is the compute service updates from all the compute nodes. These aren't happening and I think this is the reason that the attempts to schedule further down are rejected with "is disabled or has not been heard from in a while"
> 
> Do you see anything that could be an issue? Flags we use for scheduler are below also:
> 
> Thanks for your help,
> Sam
> 
> 
> It looks like the scheduler issues are related to the rabbitmq issues.   "host 'qh2-rcc77' ... is disabled or has not been heard from in a while"
> 
> What does 'nova host-list' say?   the clocks must all be synced up?
>  

Yeah all the clocks are synced up fine. Doing a nova-manage service list gives me all :-) and updated at is correct.

We only have one nova-scheduler. It gets locked up and goes at 100% CPU. nova-scheduler seems to take the compute service updates off the queue while this is happening but doesn't ack them and going by the logs doesn't process them. This is why I suspect the hosts are eventually being rejected with a "not been heard from in a while" message. 
This is a symptom though I believe as the real issue is nova-scheduler locking up, it seems to take 30-60 seconds for it to process each host to determine if it passes the filters.

Does that make sense? Any other ideas on how to debug? 

Cheers,
Sam

References

AggregateInstanceExtraSpecs very slow?
From: Sam Morrison, 2013-02-24
Re: AggregateInstanceExtraSpecs very slow?
From: Joe Gordon, 2013-02-26
Re: AggregateInstanceExtraSpecs very slow?
From: Sam Morrison, 2013-02-26
Re: AggregateInstanceExtraSpecs very slow?
From: Joe Gordon, 2013-02-26