openstack team mailing list archive

Thread
Date
Re: OpenStack Plugin for Jenkins

To: Justin Santa Barbara <justin@xxxxxxxxxxxx>
From: Monty Taylor <mordred@xxxxxxxxxxxx>
Date: Sun, 08 Apr 2012 21:51:06 -0700
Cc: openstack@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CAFoXKmpYPi1fcqf=GvUdgmt-VtCSOvkT8=e+99uhBb3ghTJQCA@mail.gmail.com>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120402 Thunderbird/11.0.1

On 04/05/2012 01:22 AM, Justin Santa Barbara wrote:
> I've got Compute functionality working with the OpenStack Jenkins
> plugin, so it can launch nova instances as on-demand slaves now, run
> builds on them, and archive the results into  swift.  I'd like to open
> GitHub issues to track your requirements, but I have a few questions.

I shall do my best to elaborate...

>> We need disposable machines that are only used for one test, which
> means spinning up and terminating hundreds of machines per day.
> 
> Sounds like we want a function to terminate the machine after the job
> has run.
> https://github.com/platformlayer/openstack-jenkins/issues/1

Yes. That seems sensible.

>> We need to use machines from multiple providers simultaneously so that
> we're resilient against errors with one provider.
> 
> Label expressions should work here; you would apply a full set of axis
> labels to each machine ("rax oneiric python26") but then you would
> filter based only on the required axes ("oneric python26").  Are labels
> sufficient for this?

Labels are sufficient for tying the jobs to the specific resource
description. I think the idea here is that we definitely want to be able
to configure multiple cloud providers, and for each provider (in some
manner) be able to configure what a machine labeled "oneiric" would look
like. (likely as a combination of image, flavor and setup script)

After that - honestly - as long as we can actually get an "oneiric"
labeled machine from _someone_ when we ask for it, we're good.

>> We need to pull nodes from a pool of machines that have been spun up
> ahead of time for speed.
> 
> This sounds like a custom NodeProvisioner implementation.  The current
> implementation is optimized around minimizing CPU hours, by doing load
> prediction.  You have a different criteria, based on minimizing launch
> latency.  It looks like it should be relatively easy to implement a new
> algorithm, although perhaps a bit tricky to figure out how to plug it in.
> 
> https://github.com/platformlayer/openstack-jenkins/issues/2

Yeah - average time to spin up a node and get it configured
_when_it_works_ is between 5 and 10 minutes. devstack takes around that
amount of time, so if we have to actually wait for the node to spin up
each time, we'd be doubling the time it takes to test a change.

Then there's the fact that clouds fail at giving us working node ALL THE
TIME. So waiting for re-trys and such (although if it was happening at
jenkins node provisioning time would be technically correct) could
potentially lead to a terribly build queue!

> 
>> We need to be able to select from different kinds of images for
> certain tests.
> 
> Are labels sufficient for this?

Yes. Configuring the characteristics of an image and assigning a label
to those characteristics will definitely let us associate tests with the
right running environment.

>> Machines need to be checked for basic functionality before being added
> to the pool (we frequently get nodes without a functioning network).
> 
> I believe Jenkins does this anyway; a node which doesn't have networking
> won't be able to get the agent.  And you can run your own scripts after
> the slave boots up ("apt-get install openjdk", for example).  Those
> scripts can basically do any checks you want.  Is that enough?

Yes- just pointing out that it's a case that we have to deal with at the
moment so it needs to be handled.

>> They need to be started from snapshots with cached data on them to
> avoid false negatives from network problems.
> 
> Can you explain this a bit more?  This is to protect against the apt
> repositories / python sources / github repos being down?  Would an http
> proxy be enough?

Yes. apt repositories, pypi and github are CONSTANTLY down, so we do a
lot of work to pre-cache network fetched resources onto a machine so
that running of the tests almost never have to involve a network fetch.
(we've learned over the last year or so that any time a test system
wants to fetch network resources, that the number of false-negatives due
to github or pypi going away is unworkably high)

It's possible that an http proxy _might_ help that - but the approach
we've been taking so far is to have one process that spins up a node,
does all the network fetching into local resources, and then snapshots
that into an image which is the basis for subsequent node creation. The
base image is updated nightly so that the amount of network update that
has to happen at node instantiation time is minimized.

jclouds itself (rather than the plugin) has a caching feature which does
the auto-image creation based on node creation criteria. So if you
combine the characteristics of a node (image, flavor, init-script, ram,
volumes, etc) with a TTL, then the first time a node meeting those
criteria is requested, it will create you one from scratch, but at the
end of the user data script run, it will create an image snapshot which
it can use for subsequent creation of nodes which match the same
description.

When we combine that with the idea of a pool of spun up nodes (also
either currently or to-be implemented inside of jclouds itself, as
having that capability has been a thing requested by a bunch of the
current jclouds userbase) - then we get the pooling and image
optimization that we're looking for (and currently doing in the python
scripts of devstack-gate) pretty transparently.

>> We need to keep them around after failures to diagnose problems, and
> we  need to delete those after a certain amount of time.
> 
> From the github docs, it sounds like you don't get access anyway because
> of the providers' policies.  Would it not therefore be better to take a
> ZIP or disk snapshot after a failed test, and then shut down the machine
> as normal?

Sometimes looking at the actual running state is nice. We currently keep
them around for a bit and have the ability to manually inject a dev's
keys on to the box on a one-off basis. We've used this ability a couple
of times to get devs to help track down particularly odd or onerous
problems. The policy decision is something I think we can (eventually)
get - I just want to make sure we have the physical ability.

That being said- we've _also_ considered that a disk or machine snapshot
might also be a nice thing. If we get a provider which allows us to
upload publically accessibly glance images, then we could do an image
snapshot of the failed machine, upload it to glance and then tell the
dev "here's the image id of your failed machine, spin one up on your own
account if you want to troubleshoot"

> 
> Also...
> 
> You currently auto-update your images, which is cool
> (devstack-update-vm-image). 

Thanks! We'd be _so_ dead if we didn't do that...

> Do you think this is something a plugin
> should do, or do you think this is better done through scripts and a
> matrix job?  I'm leaning towards keeping it in scripts.  The one thing I
> think we definitely need here is some sort of 'best match' image
> launching, rather than hard-coding to a particular ID, so that the cloud
> plugin will always pick up the newest image.
> 
> https://github.com/platformlayer/openstack-jenkins/issues/3

Well - as I mentioned before, our current plan for removing those
scripts is based on jclouds auto-imaging of NodeTemplate criteria.
Hard-coding the ID is definitely a tricky thing to think about.

Before I spoke with Adrian about his auto-caching stuff, my thoughts
here were that the plugin should just generally have the ability to cut
an image as a post-build step. If you have that, then you could have a
matrix job which requested an image of a label that described the base
image, say an "oneiric-base" label, and then that job would have a
post-build step of "snapshot to image named "oneiric"" - then there
would be a different job to actually run the tests which would use the
normal oneiric label, which would be a machine that was spun up from the
created image.

Gotchas to handle there are what happens with failures in image
creation... you don't want to fail in overwriting the oneiric image and
leave yourself with nothing. Also handling that sensibly for
multi-providers will be interesting. (Do you have a special job label
for base oneiric on each provider? Like, rax-oneiric, and then a matrix
job that did the image update job on rax-oneiric hp-oneiric and
trystack-oneiric, then have the post-build step just be "snapshot to
image id oneiric" - which would upload it to the provider that it was
called from? I guess that would work...)

Does that make sense?

Thanks!
Monty
References

OpenStack Plugin for Jenkins
From: Justin Santa Barbara, 2012-04-04
Re: OpenStack Plugin for Jenkins
From: Andrew Hutchings, 2012-04-04
Re: OpenStack Plugin for Jenkins
From: Justin Santa Barbara, 2012-04-04
Re: OpenStack Plugin for Jenkins
From: James E. Blair, 2012-04-05
Re: OpenStack Plugin for Jenkins
From: Justin Santa Barbara, 2012-04-05
Re: OpenStack Plugin for Jenkins
From: Justin Santa Barbara, 2012-04-05