launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #09169
Parallel tests checkpoint meeting notes - 2012-03-21
= Parallel Testing Status: 2012-03-21 =
== Overview ==
These two weeks were a roller coaster for the project, moving back and
forth between seemingly final failure and exhilarating success. We
addressed over 2800 test failures, conquered hangs, and fixed issues in
Launchpad, Testrepository, and LXC. Today we have our first unimpeded
runs to completion on an eight-core EC2 machine. These runs took around
55 minutes, and had a handful of test failures.
We also made progress on related jobs, such as getting our Python shell
tools packaged in Ubuntu, getting Python charm helpers added to the
official charm helpers package, and moving along our
replace-rocketfuel-* slack time project.
We still have a lot of work to do. We need to improve the continuous
integration steps in a variety of ways, for stability, reporting, and
speed; address the remaining test failures as we find them, including
getting help on a kernel issue; complete experiments to determine the
incremental value of cores to the parallelization; and get the smoothly
oiled machine we have with Juju and EC2 also running manually in the
data center. However, this update marks a major milestone for the
project, and we are pleased to have accomplished what we did this week.
== Progress towards biweekly action items ==
* [yellow] [carried over] Scaling assessment based on experimental
results
* Was waiting on webops. They announced week of March 14 they would
not have the resources to follow through in the short term. We
determined that we would pursue on ec2 using a large machine there
(an eight core instance size is available).
* Very poor test success rate (~2800 failures/errors) and hangs in
the test run made us question whether any experiment would be too
flawed. Thanks to fixing (and working around) lxc issues, test
isolation issues, and testrepository issues, the team got this
down to under 10 failures/errors (more details later) and got rid
of the known hangs, so it seemed it might no longer be a concern
* After a false start on a machine with too little memory (a
c1.xlarge instance, 8 cores and 7GB) we worked with the other
available eight core machine, which has lots of memory (a
m2.4xlarge, 8 cores and 68.4GB). This worked fine. Eight tests
simultaneously seem to need at least 15GB, based on spot checks
while tests ran.
* Initial information (determined today) is that tests on eight
cores take about 55 minutes; no comparison values yet, but in
progress (not comparable but interesting: test run takes 6 hours
on lpbuildbot).
* In-progress methodology: on the same machine, hack
testrepository’s testcommand.local_concurrency to report 1-8 cores
(see bug 957145). Get one or two results per core. These are
only initial test ordering (round robin) runs. We have a kanban
card to keep the .testrepository directory across builds. Once we
have this, we will run tests to see how results change on second
and third runs; we will then clean out .testrepository data after
each concurrency change.
* [yellow] Identify and fix Launchpad bugs for test failures
discovered in parallel test runs
* Fixed
* Launchpad, bug 954319 (Benji): readonly mode isolation bug,
caused > 2700 of our failures
* LXC, bug 959352 (Benji): partial workaround (also see in
tracking)
* Launchpad bugs 953912 (Benji), 953911 (Francesco), 953902
(Francesco): Test isolation errors
* LXC, bug 951150 (Gary, Benji): non-ephemeral home directories
were causing us problems
* LXC, bug 949956 (Benji): shared MAC address/IP address issues
* Testrepository, bug 955006 (Francesco): Unicode issues,
workaround in place
* Identified and not yet fixed
* Launchpad test isolation bugs: 953913 (Brad, in progress);
more coming
* /dev/random exhaustion needs to be addressed in setuplxc
configuration (Gary, Francesco, Benji): was causing hangs on
8-core machine. Tried replacing /dev/random with
/dev/urandom: insufficient, at least at first attempt.
rng-tools worked, reproducibly:
1. apt-get install rng-tools
2. echo "HRNGDEVICE=/dev/urandom" >> /etc/default/rng-tools
3. /etc/init.d/rng-tools start
* [Francis] [carried over] Get proposed deployment plan, as approved
by Robert, also approved by mthaddon.
* Yet to get some webops cycle on this.
== Other accomplishments ==
* [Brad, Graham] Disposition of common Python code:
* Generic code was extracted to create python-shelltoolbox, a
collection of helpers for interacting with the shell via
Python. A PPA was created and is in ~yellow. Clint has
accepted the task of redoing the packaging so that it is
backwards compatible to Lucid and sponsoring it for inclusion
in Ubuntu.
* Charm-specific code has been moved into the existing
charm-tools source package. The packaging for it is a bit
hairy as that source package already builds two binary
packages and the new Python bits would add a third. Clint
offered suggestions on how to do it but they didn’t work so he
has taken the task of getting the packaging to work for our
additions to charm-tools.
* Code specific to the buildbot charms, shared between master
and slave, is currently duplicated via the locals.py files.
This approach works but will be replaced by a PPA.
* [Francesco: slack] lpsetup:
* Added buildout files with testing support.
* Added unit tests (for argparser, handlers and utils)
* The env var LANG=C is set during the installation of launchpad
developer packages. This way we can avoid installing
language-pack-en.
* Updated the install and lxc-install sub commands to support a
custom ssh key name. The root ssh key is no longer needed, so
it is not created anymore.
* Created the recipe for debian packaging.
== Progress on tracked items ==
=== Completed by others ===
* LXC: 925024 - apparmor makes it impossible to install
postgresql-common on Precise
* LXC: aufs option should be added to lxc-start-ephemeral
=== New and incomplete ===
* LXC 959352: Ephemeral containers have "/rootfs" prefix in
/proc/self/maps entries HIGH OR CRITICAL
* Testrepository 949950 (mentioned but not filed last time):
testrepository show full subunit stream of running tests HIGH
* Testrepository 957145: force amount of parallelization, overriding
reported cores
* 961103: testrepository “String or Integer object expected for
key, unicode found”
=== Carried over and incomplete ===
* 914166 - Zope layer setup and teardown 'tests' cannot be filtered
by testr
* no activity in the last eight weeks
* RT 50242 - get a buildbot machine for testing
* actively in progress again
* We prefer EC2 for tests
== Goals for next meeting ==
1. Bug fixes for test failures discovered in parallel test runs.
Already known targets:
* Launchpad
* 953913: test isolation error
* /dev/random exhaustion solution in setuplxc
* Testrepository/zope.testing
* 609986: subunit support for layer failures
* Buildbot improvements
* clean up old broken ephemeral lxc containers
* keep .testrepository data around between builds
* report failures more accurately
* make tests always randomly ordered
* Tracking
* LXC 959352
2. Deliver scaling assessment based on experimental results, using ec2
(carry over from previous two weeks)
3. Get data center box running tests, and have a single comparison run
with ec2.
* /dev/random exhaustion solution approved and installed
--
Francis J. Lacoste
francis.lacoste@xxxxxxxxxxxxx
Attachment:
signature.asc
Description: OpenPGP digital signature