← Back to team overview

openstack team mailing list archive

running HA cluster of guests within openstack

 

I likely am not the first one to ask this, but since I didn't find a
thread about it I start one.

Is there any shared experience available what are the capabilities of
OpenStack to run cluster of guests in the cloud? Do you have
experience of the following questions, or links to more info? The
questions relate to running a legacy HA cluster in virtual env, and
moving it into cloud...

1. Private networks between guests
  -> Doable now using Quantum
1.1. Defining VLANs visible to guest machines to separate clusters
internal traffic,
       VLAN tags should not be stripped by host (QinQ)
1.2. Set pre-defined MAC addresses for the guests, needed by non-IP
       traffic within the guest cluster (layer2 addressing)
      - will Melange do this, according to docs it's not in plans?
2. HA capabilities
2.1. Failure notification times need to be fast, i.e. no tcp timeout allowed
      - there seems to be some activity to integrate pacemaker
2.2. Failure notification of both guests and hosts needs to be included
2.3. Guest cluster controller should be able to monitor the states,
      and get fast notifications of the events.
      - rather in milliseconds than in seconds
      - basically the host should have parent of the guest pid notifying
        of a child process failure.
      - Host should have a virtual watch-dog noticing of a guest being stuck
2.4. Failure recovery time, how fast can OS bring up failed guest?
      - any measurements of time from failure to noticing it,
        and time that the guest is restarted and back up?
2.5. virtual HW manager (guest isolation)
      - Any plans to integrate a piece from which a state of guest could
        be reliably queried, e.g. guaranteeing that if I ask to power
off another
        guest, it get's done in given time (millisecs), and not
pending on e.g. some tcp
        timeout, and thus leading to split brain case of running two
similar guest
        simultaneously. E.g. starting another guest to replace shut
down one, but
        due some communications error the first one didn't really shut
before the
        new one is already up.
     - should be able to reliably cut down the guests network and disk access to
       guarantee the above case
2.6. Shared disks
     - Could there be a shared scsi device concept for the legacy HW
abstraction?
     - Qemu/KVM supports this, what would it take to make OS to understand
       such disk devices?
2.7. Isolation of redundant nodes
     - In some cases there are nodes that need to backup each others 2N, N+1,
       there should be a way to make sure they run on different host.
     - This project might be aiming for that?
http://wiki.openstack.org/DistributedScheduler

This was something from top of my head, it would be interesting to
hear your thoughts about the issues. This need is coming from the
telco world, which would need a telco-cloud with such more real-time
features in it. Certainly the same applies to many other legacy
environments too.

BR,

 Ilkka Tengvall


Follow ups