maas-devel team mailing list archive

Thread
Date
Re: Re-architecting without cobbler

To: Gavin Panella <gavin.panella@xxxxxxxxxxxxx>
From: Robert Collins <robert.collins@xxxxxxxxxxxxx>
Date: Thu, 10 May 2012 07:32:24 +1200
Cc: Clint Byrum <clint.byrum@xxxxxxxxxxxxx>, maas-devel@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CALL7chnqPq094sH9iucWjNuNq2_GQ9bb-2DKR9zxFZMt2NfBgg@mail.gmail.com>
Sender: robertc@xxxxxxxxxxxxxxxxx
On Thu, May 10, 2012 at 1:23 AM, Gavin Panella
<gavin.panella@xxxxxxxxxxxxx> wrote:

>>> - zero downtime: rolling upgrades.
>>
>> This isn't the same - a stateful pserv will have short downtime per
>> pserv; stateless won't.
>
> I meant zero downtime across the cluster as a whole. Individual parts
> may blip but the cluster as a whole stays available. Even large schema
> changes cause only a degradation of service in one partition at a
> time.

Strictly speaking, this is downtime, and users will perceive it as
such. Its not complete downtime - Launchpad calls this 'partial
downtime' , and yes its better than complete downtime. So a single bit
store will have the schema changes requiring something like FDT to be
short-and-sweet. OTOH the database is going to be tiny: 100K nodes is
100K node rows + (say) 400K MAC nodes. If every node were to have 1K
of data, both node row + mac rows, we'd have a 100MB DB - thats
/tiny/. Add in fill factor of 25% on heap pages, call it 125MB - still
extremely small. 1M nodes -> 1.25GB DB (time to put some dedicated RAM
in the DB server).

>>
>>> - a very high degree of scalability.
>>
>> Seems the same to me, except that we don't need to write a stateless
>> API proxy - so it things to create.
>>
>>> My dodgy diagram, attached, and which probably employs zero
>>> pre-existing iconographies, tries to convey some of this.
>>
>> Perhaps I'm missing something, but I don't see pserv on that diagram?
>
> Yeah, sigh, I f**ked up. The big box named MAAS with the cloud haircut
> was meant to be (web API + metadata + cobbler-assimilated).

kk

>> I don't see any particular a-priori reason to avoid having N
>> state-maintaining services cooperating to provide MAAS as a whole -
>> thats very much what I advocate - an SOA approach; but OTOH when you
>> have a state-maintaining service, that service needs an HA story, it
>> needs failure-mode management in its clients, it needs a
>> dealing-with-absent-services story, and it needs a backup story. I
>> don't think the MAAS dataset is large enough or complex for these
>> things to be a good tradeoff vs maintaining all your state in a HA
>> core service, with horizontally scaling helper services interrogating
>> it as you scale.
>
> Okay, that's fair. I think it will be a problem eventually. Servers
> are inexorably getting smaller.

I agree that it could be a problem eventually. AIUI we have roughly 3
goals for development of MAAS today:
 - optimise market adoption: MAAS is an enabler for Juju, and as such
the wider adoption MAAS gets, the wider adoption Juju can see for
bare-metal workloads
 - deliver a system capable of robustly handling very large new
clouds, with the next size goal being 100K nodes
 - deliver the next iteration -reliably- in 4-5 months (we need time
for the dust to settle at the end of the cycle, last minute stuff is
not good)

Aiming for 100K nodes supported means, to me, that we need to design
for 1M nodes supported. A 1-2GB DB could be served, with the entire
thing hot in RAM, from extremely modest hardware. Distribute out the
provisioning agents in batches of (say) 20K nodes, and you'll have 50
provisioning agents + a rabbit getting ~ 60 messages/second. We know
rabbit scales to 10K+ messages/second.

I assume that folk won't be super-stingy for the core infrastructure
nodes - we can't expect them to buy super-big machines for things
doing this overhead, but conversely, we can expect them to be buying
modern machines and dedicating them to the task.
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server says
"You may be limited to approximately 100 transaction commits per
second per client in situations where you don't have such a durable
write cache (and perhaps only 500/second even with lots of clients). "
When talking about speeds *without* a RAID cache battery - e.g.
non-hardware-RAID; Postgresql 9.2 has some benchmarks that aim to
improve this and are showing 12K commits/second on the same
environment. (See the 'group commit' feature).

So, modelled like this, do you see a scaling issue with a single
control node? I don't, but if I'm missing something, I'd sure like to
know!

I totally grant that there is increased dependency on the uptime of
the core server with a centralised model. So, what could we do to
mitigate that? Well, we could combine both proposals and say:
 - cobbler dies
 - dhcp/tftp/dnsmasq will be managed via celery - a 'provider'
   - these can be run in active-active HA mode (uptime sensitive installs)
 - we can run HA rabbit, or two non-HA rabbits (uptime sensitive installs)
 - A MAAS can run many providers
 - We will provide an API proxy to talk to multiple MAAS for folk that
want to partition their environment at the MAAS level rather than the
provider level.

The only icky point there, then is that authentication would be
replicated out to each MAAS provider and we'd have to do glue to do
that (and default-settings for things and so forth).

Another way to address this, is to take the last bullet point there
and do what Amazon does, say something like:
 - "For HA run services in multiple regions, each region is totally
independent."
 - A single MAAS install is a 'region', its moderately HA itself, it
won't go down spuriously or casually
 - install two MAAS's, and your API clients (like Juju, and yes, our
web UI) can be told of both, and configure what they want
appropriately.

I think doing this, and not providing a single proxy, is actually
better, for a few reasons:
 - its a pattern cloud-api consumers are used to (see AWS :P)
 - the MAAS clusters will be truely independent, so a failure on one
cannot cascade (e.g. via bad state updates) to any other one
 - we have less work to do.

>> I guess the key thing you allude to, is that you could in principle
>> permit provisioning to happen when the main MAAS server is AWOL, but
>> that implies some significant complexity around authentication - and a
>> state synchronisation mechanism for when MAAS itself comes back.
>
> I don't think any state synchronisation would be necessary. Well... in
> one direction only: whatever global state is needed should be pushed
> out and/or pulled by the (API+...) services. It should never move the
> other way.

There are two sets of data - the ick I refer to above:
 - usercodes, default settings, cluster wide /anything/ is one set
 - node specific data, which scale as you add nodes

If we, for instance, were to have a way of saying 'these nodes are in
group 'blue'', then that is something which has to synchronise across
all the state stores in the system, or be centralised. If its
centralised, it needs to know when nodes are deleted, if its not
centralised, then clients need to handle a particular sub-node being
AWOL so that they can update it when it comes back. (One simple way of
handling it is to say to the user 'try again later', but that then
gets back to 'will users be blocked when a single provisioning agent
is down?').

> Coming out of Oakland seems to be the message that MAAS should have a
> simpler - than now, even - user management story, which reduces this
> problem further.
>
> Overall, I'm suggesting not putting the important parts all in one
> place, and instead putting a unified API front (which would be the
> stupid stateless bit) on a bunch of (API+) services.

>> If we come back to the core of MAAS - a single tenant API provider for
>> provisioning hardware like a cloud, this doesn't seem justified to me:
>> even a very large environment say 100K nodes) won't have a high
>> frequency of machine role turnover (100's of machines/minute) :
>> machines will be brought up and put into openstack or hadoop, and
>> within that environment get lots of use; periodically maintenance will
>> happen, gracefully, but thats still going to be something where the
>> impact of a short outage at the MAAS controller has minimal impact.
>>
>> (Sketch numbers for my model: each piece of hardware gets deployed for
>> a month or more at a time, except for staging/test environments which
>> are a) relatively small and b) torn down and replaced a lot)
>> 100K machines
>> 100K * (at most) 12 -- <= 1.2M allocations a year
>>                                <= 1.2M deallocations a year
>> 525600 minutes/year
>> -> about 3 allocation-or-deallocation operations per minute, on average.
>>
>> A 10 minute outage, is about 30 queued operations.

(Or 300 for a 1M node provider)

> An imagined MAAS reseller, and its reputation, would probably want
> better. Also, it's a cloud-like environment; if a machine can be
> deployed in a few minutes then they'll be used like people use
> instances in AWS, i.e. a lot more provisioning operations that you've
> guessed at.

The simpler story we're to focus on is MAAS environment where every
user is an admin: that implies no multi-tenant environments, and
resale of a single tenant MAAS story becomes limited to the size of a
single tenant, where the risk to the reseller of a widespread outage
is limited to one client at a time.

We need to understand what users want so that we can make good
decisions about building it for them - What feedback have we had since
MAAS was announced? What sort of things are people trying to do? At
what scale do they say 'right, I'll use MAAS to bring up openstack,
and fiddle on top of *that*'.

In the absence of that data, we're reduce to putting forward what we
think users want, which is always a bit risky :)

We have though, 2 primary use cases we want to enable for Juju (here,
Juju's needs are our proxy for user needs):
 * Run up openstack on metal
 * Run up hadoop on metal

The former needs low machine turnover (basically install then forget
until BIOS upgrades are needed, and they would be rolling by nature, +
a small test of test machines for well, openstack admin testing).
The latter also needs low machine turnover, for the same reason.

Yes, it is entirely possible there are users out there with 100K or 1M
nodes, that want to use MAAS multi-tenant, or use MAAS with large
clusters *and* high node use change rates. I propose that for the
former we advise them to bring up openstack on top of MAAS: that gives
them robust and reliable user management, quotas etc. For the latter,
lets wait and see. The foundations I'm proposing should (with hardware
RAID in the MAAS box) trivially handle 5K transactions/second through
postgresql itself, we can horizontally scale the HTTP interface,
rabbit will handle another order of magnitude messages on top of that.
The average I estimated for 1-month allocations on 1M nodes was 30
transactions/minute; thats 0.5/second - we have 5 orders of magnitude
headroom, or say lease times of of hour.

-Rob
References

Re-architecting without cobbler
From: Julian Edwards, 2012-05-04
Re: Re-architecting without cobbler
From: Gavin Panella, 2012-05-08
Re: Re-architecting without cobbler
From: Robert Collins, 2012-05-09
Re: Re-architecting without cobbler
From: Gavin Panella, 2012-05-09