openstack team mailing list archive
-
openstack team
-
Mailing list archive
-
Message #13925
Re: Jenkins and transient failures
"Kevin L. Mitchell" <kevin.mitchell@xxxxxxxxxxxxx> writes:
> One of the things that's really bugging me these days is transient
> failures, such as the inability to download a package, causing a gate
> job to fail. It seems to me that we can distinguish "test failure" from
> "environment build failure" easily enough, and automatically retry in
> the latter case. Is this possible in practice with our current CI
> infrastructure?
Yes, that's certainly been a big annoyance lately. That's a good
suggestion, though there are a couple of things that make it not
straightforward: jenkins doesn't have a facility to easily express
(through some means such as an exit code) that a job has had anything
other than a simple success/failure outcome; I believe that's an open
feature request with jenkins. Even if we worked around that, for better
or worse since we started using virtualenv's instead of packages, a lot
of what we're testing now includes things like dependencies,
configuration, installation, and other items that are ancillary to unit
tests themselves. If a change adds "blorgh==1.0" to pip-requires, is
the inability to install that a transient or permanent error?
These may be solvable problems, but they'll take some engineering
effort, and I have some ideas of where we may get a better return for
our work.
Most of the transient failures can be attributed to two causes: failures
downloading packages, and failures connecting to gerrit.
Monty has been working on a pypi mirror setup so that we can be
responsible for ensuring that all of the python packages that pip needs
to install are available to the jenkins slaves. We had hoped that
simply adding a mirror would be enough, but as long as pip knows about
both pypi.openstack.org and pypi.python.org, it will end up crawling the
web pages of projects listed in the pypi mirror looking for new
versions. So to really get to the point where we can run jobs with no
unnecessary network dependencies, we have to be sure that our pypi
mirror has every package needed, including when new dependencies are
added. At the design summit, it was decided that we should move to a
global list of dependencies for OpenStack -- with that in place, it
should be easy to maintain the package inventory for our pypi mirror --
we can update the mirror when changes to the global dependency list are
merged. However, that work seems to be stalled:
https://bugs.launchpad.net/openstack-ci/+bug/995607
A reason we've seen even more errors downloading packages in recent
weeks is that there have been some flaws with our pypi mirror
implementation. Monty has been working this weekend to rectify those,
so hopefully we'll see a significant drop in these errors when that is
finished.
And finally, as we've increased the number of builds jenkins runs (in
order to run tests on new patchsets when they are uploaded, as well as
run merge gates in parallel (which sometimes requires multiple runs of
tests)) we have increased the load on the gerrit server which
occasionally results in transient errors. Tuning gerrit is a bit of a
black art; there's plenty of capacity on the server, but I believe
further tuning is going to require a bit more instrumentation than we
have now. Clark Boylan has been working on adding Java Melody to gerrit
to help with that, so I hope we can get a handle on that soon. In the
mean time, we have some ideas about how to work around that (retry with
exponential backoff in the git scripts that jenkins uses, or cloning
directly from a git repo instead of via gerrit).
So with all that background, I think we should discuss the following at
the CI team meeting on Tuesday:
1) What's the status of the global dependency list? Can we update:
https://bugs.launchpad.net/openstack-ci/+bug/995607
Can we get it implemented in a reasonable amount of time to address
these other issues (perhaps a couple of weeks)?
2) If not, can we make the pypi mirror be the only source of python
packages for jenkins sooner? When we used pip bundles with tox, we set
up the jobs to use the bundle unless there was a change to a -requires
file. Could we do something similar and make pypi.openstack.org the
only pip mirror unless there is a dependency change being tested?
3) Decide on a course of action to mitigate failures from transient
gerrit errors (but continue to work on eliminating them in the first
place).
4) Decide how to implement retriggering with Zuul.
It's my very strong belief that our build systems should be robust
enough that we don't need to retrigger jobs because of transient
failures. It is not a good use of the time of busy and skilled
developers to babysit jenkins jobs and retry them if the fail. So I
think our priority should always be eliminating the causes of those
failures, which is why I listed items 1-3 above in that order. However,
there are always likely to be new causes for transient failures, and
while we work on correcting them, we shouldn't make retrying builds any
harder than they need to be.
We have a couple of suggestions as to how to implement that in Zuul. It
should be easy to do, we just need to think through some user experience
items.
So, in short, the recent badness with transient failures sucks, but I
think we have some productive avenues we can take to get to a much
better place soon.
-Jim
Follow ups
References