openstack team mailing list archive

Thread
Date

Re: A plea from an OpenStack user

To: Tim Bell <Tim.Bell@xxxxxxx>
From: stuart.mclaren@xxxxxx
Date: Wed, 29 Aug 2012 14:21:27 +0100 (IST)
Cc: "openstack@xxxxxxxxxxxxxxxxxxx" <openstack@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <5D7F9996EA547448BC6C54C8C5AAF4E5653FA737@CERNXCHG01.cern.ch>
User-agent: Alpine 2.00 (DEB 1167 2008-08-23)


I think its great that we're having this discussion.

In the hope that its informative, I'd like to give some info on issues
we're looking at when moving our Glance deployment to Folsom.  A lot of
this is in common with Ryan, but there are a couple of twists due to our
goals of maximization of uptime (ie we are hoping to do a rolling rather
than lock-step upgrade) and decoupling upgrades. Also, I mention the
case where you may have existing Glance clients which you don't control.

In our case the upgrade of the various components (swift/glance/nova
etc) will be staggered rather than performed simultaneously. (For large
organisations I imagine this may often be the case.)  If we are to avoid
downtime for Glance we must simultaneously support ids and uuids. These
must be presented appropriately to Nova nodes which we must assume are
moving targets -- ie will initially be on older code, but will upgrade
to Folsom.

We have some ideas on how this may be possible but haven't worked through
all the details yet (in particular the Nova database entries)... but
there could be some coding for Nova/Glance and some deploys of the
interoperable code before eventually switching to standard Folsom.

(Jay -- I don't think scripts are sufficient here?)

If Glance were publically available its not clear how the id/uuid change
could be worked through gracefully where we didn't have control over the
glance client. Ie the upgrade would break existing customers' clients
which expected an id rather than a uuid.

I agree with everyone about testing upgrade paths between releases --
if we could include paths for interoperability/rolling upgrades and
perhaps define an official order in which components (swift/nova etc)
should be upgraded that would be great too.

I should mention that in the same way as Nova has a dependency on Glance
Glance has a dependency on Swift ... so far that upgrade path seems
absolutely painless -- kudos to the Swift folks.

-Stuart


On Wed, 29 Aug 2012, Tim Bell wrote:


We should not also forget the documentation for the migration process
itself. We need to have stable documentation (such as perform software
installation, run upgrade script and restart service) while allowing
post-release migration bugs to be resolved (when migration problems are
found in the field). Automated testing can find some problems but the
flexibility of OpenStack configuration would make it likely that there are
other scenarios not covered by the test suite.

Another item are to clearly identify the order in which services can be
upgraded and what are the possibilities for pushing a small part of the
cloud to a more recent version ohile maintaining the majority on the
previous stable release.

This sort of feedback loop is exactly what I hope for from the interaction
between the user and technical  communities.

Since we're Essex based, we've not had to face this yet but Essex to Folsom
will be a good chance for improvements to be included in the core code.

Tim

-----Original Message-----
From: openstack-bounces+tim.bell=cern.ch@xxxxxxxxxxxxxxxxxxx
[mailto:openstack-bounces+tim.bell=cern.ch@xxxxxxxxxxxxxxxxxxx] On Behalf
Of Jay Pipes
Sent: 29 August 2012 08:32
To: openstack@xxxxxxxxxxxxxxxxxxx
Subject: Re: [Openstack] A plea from an OpenStack user

Ryan, thank you for your excellent and detailed comments about problems
you encountered during the upgrade process. This is precisely the kind of
constructive feedback that is needed and desired.

Someone mentioned automated testing of upgrade paths. This is exactly what
needs to happen. Hopefully the Tempest folks can work with the CI team in
the G timeframe to incorporate upgrade path testing for the OpenStack
components. It likely won't solve ALL the issues -- such as the poor LDAP

port

in Keystone Light -- but it will at least serve to highlight where the

major issues

are BEFORE folks run into them. It will also help identify those tricky

things like

the Glance issue below:
Glance itself upgraded its data effectively, but failed to produce scripts

to

modify the Nova image database IDs at the same time.

Thanks again,
-jay

On 08/28/2012 05:26 PM, Ryan Lane wrote:

Yesterday I spent the day finally upgrading my nova infrastructure
from diablo to essex. I've upgraded from bexar to cactus, and cactus
to diablo, and now diablo to essex. Every single upgrade is becoming
more and more difficult. It's not getting easier, at all. Here's some
of the issues I ran into:

1. Glance changed from using image numbers to uuids for images. Nova's
reference to these weren't updated. There was no automated way to do
so. I had to map the old values to the new values from glance's
database then update them in nova.
The mention of testing upgrade paths go 2. Instance hostnames are
changed every single release. In bexar and cactus it was the ec2 style
id. In diablo it was changed and hardcoded to instance-<ec2-style-id>.
In essex it is hardcoded to the instance name; the instance's ID is
configurable (with a default of instance-<ec2-style-id>, but it only
affects the name used in virsh/the filesystem. I put a hack into
diablo (thanks to Vish for that hack) to fix the naming convention as
to not break our production deployment, but it only affected the
hostnames in the database, instances in virsh and on the filesystem
were still named instance-<ec2-style-id>, so I had to fix all libvirt
definitions and rename a ton of files to fix this during this upgrade,
since our naming convention is the ec2-style format. The hostname
change still affected our deployment, though. It's hardcoded. I
decided to simply switch hostnames to the instance name in production,
since our hostnames are required to be unique globally; however, that
changes how our puppet infrastructure works too, since the certname is
by default based on fqdn (I changed this to use the ec2-style id).
Small changes like this have giant rippling effects in
infrastructures.

3. There used to be global groups in nova. In keystone there are no
global groups. This makes performing actions on sets of instances
across tenants incredibly difficult; for instance, I did an in-place
ubuntu upgrade from lucid to precise on a compute node, and needed to
reboot all instances on that host. There's no way to do that without
database queries fed into a custom script. Also, I have to have a
management user added to every single tenant and every single
tenant-role.

4. Keystone's LDAP implementation in stable was broken. It returned no
roles, many values were hardcoded, etc. The LDAP implementation in
nova worked, and it looks like its code was simply ignored when auth
was moved into keystone.

My plea is for the developers to think about how their changes are
going to affect production deployments when upgrade time comes.

It's fine that glance changed its id structure, but the upgrade should
have handled that. If a user needs to go into the database in their
deployment to fix your change, it's broken.

The constant hardcoded hostname changes are totally unacceptable; if
you change something like this it *must* be configurable, and there
should be a warning that the default is changing.

The removal of global groups was a major usability killer for users.
The removal of the global groups wasn't necessarily the problem,
though. The problem is that there were no alternative management
methods added. There's currently no reasonable way to manage the
infrastructure.

I understand that bugs will crop up when a stable branch is released,
but the LDAP implementation in keystone was missing basic
functionality. Keystone simply doesn't work without roles. I believe
this was likely due to the fact that the LDAP backend has basically no
tests and that Keystone light was rushed in for this release. It's
imperative that new required services at least handle the
functionality they are replacing, when released.

That said, excluding the above issues, my upgrade went fairly smoothly
and this release is *way* more stable and performs *way* better, so
kudos to the community for that. Keep up the good work!

- Ryan

_______________________________________________
Mailing list: https://launchpad.net/~openstack
Post to     : openstack@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp


_______________________________________________
Mailing list: https://launchpad.net/~openstack
Post to     : openstack@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp

Follow ups

Re: A plea from an OpenStack user
From: Eoghan Glynn, 2012-09-04

References

A plea from an OpenStack user
From: Ryan Lane, 2012-08-28
Re: A plea from an OpenStack user
From: Jay Pipes, 2012-08-29
Re: A plea from an OpenStack user
From: Tim Bell, 2012-08-29