← Back to team overview

openstack team mailing list archive

Re: [Openstack-operators] Nova Controller HA issues

 

I'm running 2 full nova controllers behind a NGINX load balancer.  While there still is that chance of half completed tasks, it's been working very well.

Each nova controller is running (full time) nova-scheduler, nova-cert, keystone, and 6 nova-api processes. All API requests go through NGINX which reverse proxies the traffic to these 2 systems.  

example Nginx nova-api config:
upstream nova-api  {
  server hostA:8774 fail_timeout=30s;
  server hostB:8774 fail_timeout=30s;
  server hostA:18774 fail_timeout=30s;
  server hostB:18774 fail_timeout=30s;
  server hostA:28774 fail_timeout=30s;
  server hostB:28774 fail_timeout=30s;
  server hostA:38774 fail_timeout=30s;
  server hostB:38774 fail_timeout=30s;
  server hostA:48774 fail_timeout=30s;
  server hostB:48774 fail_timeout=30s;
  server hostA:58774 fail_timeout=30s;
  server hostB:58774 fail_timeout=30s;
}

server {
 listen x.x.x.x:8774;
  server_name public.name;

  location / {
    proxy_pass  http://nova-api;
    proxy_set_header        Host            "public.address:8774";
    proxy_set_header        X-Real-IP       $remote_addr;
    proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
  }
}


Attached is a diagram that gives a brief overview of the HA environment I've setup.

--Jason Hedden

Attachment: Essex_Service_Map.pdf
Description: Adobe PDF document

On Jun 15, 2012, at 5:36 AM, John Garbutt wrote:

> I know there is some work in the XenAPI driver to make it resilient to these kinds of failures (to allow frequent updates of the nova code), and I think there were plans for the work to be reused in the Libvirt driver.
>  
> AFAIK, in Essex and lower, bad things can happen if you don’t wait for all the tasks to finish. You may well be OK some of the time.
>  
> It boils down to an issue of consuming the message from Rabbit but not completing the task, and not being able to recover from half completed tasks.
>  
> Hope that helps,
> John
>  
> From: Igor Laskovy [mailto:igor.laskovy@xxxxxxxxx] 
> Sent: 15 June 2012 11:31
> To: Christian Parpart
> Cc: John Garbutt; openstack-operators@xxxxxxxxxxxxxxxxxxx; &lt,openstack@xxxxxxxxxxxxxxxxxxx&gt,
> Subject: Re: [Openstack-operators] Nova Controller HA issues
>  
> I am using OpenStack for my little lab for a short time too))
> 
> Ok, you are right of course, but I meant a some another design when told about virtualization controller nodes.
> 
> It is can be only two dedicated hypetvisor with dedicated share/drbd between them. This hypervisors will be standalone, and not be part of nova. Than, maybe pacemaker or another tool can take availability function to restart VM to alive node when active will die.
> 
> Main question here - how worth can be if occurs controller nodes unexpected power off. In another word, when VM restart it will be in crash consisted state. 
> Will some nova services will loose here?
> Will RabbiMQ loose some data here? (I am new to RabbitMQ too)
> 
> Igor Laskovy
> facebook.com/igor.laskovy
> Kiev, Ukraine
> 
> On Jun 15, 2012 10:54 AM, "Christian Parpart" <trapni@xxxxxxxxx> wrote:
> Hey,
>  
> well, I said "I might be wrong" because I have no "clear" vision on how OpenStack works in
> its deepest detail, however, I would not like to depend on a controller node that
> is inside a virtual machine, controlled by compute nodes, that are controlled by the controller
> node. This sounds quite like a chicken-and-egg problem.
>  
> However, at the time of this writing, I think you'll have to have a working nova-scheduler process,
> which is responsible on deciding on which compute node to spawn your VM (what else?),
> and think about what you do when this (or all your controller-)VMs terribly die,
> and you want to rebuild it, how do you plan to do this when your controller node is out-of-service?
>  
> I in my case have put the controller services onto two compute nodes, and use Pacemaker
> to switch between them, in case one node goes down, the other can take over (via shared service-IP).
>  
> Again, these are my thoughts, and I am using OpenStack for just about a month now :-)
> But I hope this helps a bit...
>  
> Best regards,
> Christian Parpart.
>  
> On Fri, Jun 15, 2012 at 8:16 AM, Igor Laskovy <igor.laskovy@xxxxxxxxx> wrote:
> Why? Can you please clarify.
> 
> Igor Laskovy
> facebook.com/igor.laskovy
> Kiev, Ukraine
> 
> On Jun 15, 2012 1:55 AM, "Christian Parpart" <trapni@xxxxxxxxx> wrote:
> I don't think putting the controller node completely into a VM is a good advice,
> at least when speaking of nova-scheduler and nova-api (if central).
>  
> I may be wrong, and if so, please correct me.
> 
> Christian.
>  
> On Thu, Jun 14, 2012 at 7:20 PM, Igor Laskovy <igor.laskovy@xxxxxxxxx> wrote:
> Hi, have any updates there?
> Can anybody clarify what happens if controller nodes just going hard shutdown?
> 
> I thinking about solution with two hypervisors and putting controller
> node in VM shared storage, which can be relaunched when active
> hypervisor will die.
> Any ideas, advise?
> 
> 
> On Tue, Jun 12, 2012 at 3:52 PM, John Garbutt <John.Garbutt@xxxxxxxxxx> wrote:
> > Sure, I get your point.
> >
> > I think Florian is working on some docs to help on that.
> >
> > Not sure how much has been done already.
> >
> >
> >
> > Cheers,
> >
> > John
> >
> >
> >
> > From: Christian Parpart [mailto:trapni@xxxxxxxxx]
> > Sent: 12 June 2012 13:47
> > To: John Garbutt
> > Cc: openstack-operators@xxxxxxxxxxxxxxxxxxx
> > Subject: Re: [Openstack-operators] Nova Controller HA issues
> >
> >
> >
> > Hey, ya I also found this page, but didn't find it yet that helpful, it
> > rather much sounds like a theoretical paper on
> >
> > how they implemented it rather then telling me on how to actually make it
> > happen (from the sysop point of view :-)
> >
> >
> >
> > I hoped that someone had to face this already, since I really find it very
> > unintuitive to realize, or need to wait until
> >
> > I get more time to investigate dedicated. :-)
> >
> >
> >
> > Regards,
> >
> > Christian.
> >
> > On Tue, Jun 12, 2012 at 12:52 PM, John Garbutt <John.Garbutt@xxxxxxxxxx>
> > wrote:
> >
> > I thought Rabbit had a built in HA solution these days:
> >
> > http://www.rabbitmq.com/ha.html
> >
> >
> >
> > From: openstack-operators-bounces@xxxxxxxxxxxxxxxxxxx
> > [mailto:openstack-operators-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of
> > Christian Parpart
> > Sent: 12 June 2012 09:59
> > To: openstack-operators@xxxxxxxxxxxxxxxxxxx
> > Subject: [Openstack-operators] Nova Controller HA issues
> >
> >
> >
> > Hi all,
> >
> >
> >
> > after spending the whole evening in making our cloud controller node highly
> > available
> >
> > using Corosync/Pacemaker, at which I am really proud about it, I am having
> > just a few
> >
> > problems left, and the one that freaks me out the most is rabbitmq-server.
> >
> >
> >
> > That beast I just seem to find no good documenation on how to set
> > rabbitmq-server up
> >
> > properly for HA'ing.
> >
> >
> >
> > Does anyone have ever tried to set a nova controller (including rabbitmq
> > dependency) up for HAing?
> >
> > If so, I'd be pleased to share experiences, especially to the latter part.
> > :-)
> >
> >
> >
> > Best regards,
> >
> > Christian Parpart
> >
> >
> >
> >
> > _______________________________________________
> > Openstack-operators mailing list
> > Openstack-operators@xxxxxxxxxxxxxxxxxxx
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> >
> 
> 
> 
> --
> Igor Laskovy
> Kiev, Ukraine
>  
>  
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp


Follow ups

References