launchpad-dev team mailing list archive

Thread
Date
Parallel tests checkpoint meeting notes - 2012-03-21

To: Launchpad Development List <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
From: "Francis J. Lacoste" <francis.lacoste@xxxxxxxxxxxxx>
Date: Thu, 22 Mar 2012 12:20:47 -0400
Organization: Canonical
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120309 Thunderbird/11.0
= Parallel Testing Status: 2012-03-21 =

== Overview ==

These two weeks were a roller coaster for the project, moving back and
forth between seemingly final failure and exhilarating success.  We
addressed over 2800 test failures, conquered hangs, and fixed issues in
Launchpad, Testrepository, and LXC.  Today we have our first unimpeded
runs to completion on an eight-core EC2 machine.  These runs took around
55 minutes, and had a handful of test failures.


We also made progress on related jobs, such as getting our Python shell
tools packaged in Ubuntu, getting Python charm helpers added to the
official charm helpers package, and moving along our
replace-rocketfuel-* slack time project.


We still have a lot of work to do.  We need to improve the continuous
integration steps in a variety of ways, for stability, reporting, and
speed; address the remaining test failures as we find them, including
getting help on a kernel issue; complete experiments to determine the
incremental value of cores to the parallelization; and get the smoothly
oiled machine we have with Juju and EC2 also running manually in the
data center.  However, this update marks a major milestone for the
project, and we are pleased to have accomplished what we did this week.


== Progress towards biweekly action items ==

 * [yellow] [carried over] Scaling assessment based on experimental
 results

    * Was waiting on webops.  They announced week of March 14 they would
    not have the resources to follow through in the short term.  We
    determined that we would pursue on ec2 using a large machine there
    (an eight core instance size is available).

    * Very poor test success rate (~2800 failures/errors) and hangs in
    the test run made us question whether any experiment would be too
    flawed.  Thanks to fixing (and working around) lxc issues, test
    isolation issues, and testrepository issues, the team got this
    down to under 10 failures/errors (more details later) and got rid
    of the known hangs, so it seemed it might no longer be a concern


    * After a false start on a machine with too little memory (a
    c1.xlarge instance, 8 cores and 7GB) we worked with the other
    available eight core machine, which has lots of memory (a
    m2.4xlarge, 8 cores and 68.4GB).  This worked fine.  Eight tests
    simultaneously seem to need at least 15GB, based on spot checks
    while tests ran.

    * Initial information (determined today) is that tests on eight
    cores take about 55 minutes; no comparison values yet, but in
    progress (not comparable but interesting: test run takes 6 hours
    on lpbuildbot).

    * In-progress methodology: on the same machine, hack
    testrepository’s testcommand.local_concurrency to report 1-8 cores
    (see bug 957145).  Get one or two results per core.  These are
    only initial test ordering (round robin) runs.  We have a kanban
    card to keep the .testrepository directory across builds. Once we
    have this, we will run tests to see how results change on second
    and third runs; we will then clean out .testrepository data after
    each concurrency change.

  * [yellow] Identify and fix Launchpad bugs for test failures
  discovered in parallel test runs

    * Fixed
        * Launchpad, bug 954319 (Benji): readonly mode isolation bug,
        caused > 2700 of our failures

        * LXC, bug 959352 (Benji): partial workaround (also see in
        tracking)

        * Launchpad bugs 953912 (Benji), 953911 (Francesco), 953902
        (Francesco): Test isolation errors

        * LXC, bug 951150 (Gary, Benji): non-ephemeral home directories
        were causing us problems

        * LXC, bug 949956 (Benji): shared MAC address/IP address issues

        * Testrepository, bug 955006 (Francesco): Unicode issues,
        workaround in place

    * Identified and not yet fixed

        * Launchpad test isolation bugs: 953913 (Brad, in progress);
        more coming

        * /dev/random exhaustion needs to be addressed in setuplxc
        configuration (Gary, Francesco, Benji): was causing hangs on
        8-core machine.  Tried replacing /dev/random with
        /dev/urandom: insufficient, at least at first attempt.
        rng-tools worked, reproducibly:

            1. apt-get install rng-tools
            2. echo "HRNGDEVICE=/dev/urandom" >> /etc/default/rng-tools
            3. /etc/init.d/rng-tools start

  * [Francis] [carried over] Get proposed deployment plan, as approved
  by Robert, also approved by mthaddon.

    * Yet to get some webops cycle on this.

== Other accomplishments ==

    * [Brad, Graham] Disposition of common Python code:

        * Generic code was extracted to create python-shelltoolbox, a
        collection of helpers for interacting with the shell via
        Python.  A PPA was created and is in ~yellow.  Clint has
        accepted the task of redoing the packaging so that it is
        backwards compatible to Lucid and sponsoring it for inclusion
        in Ubuntu.

        * Charm-specific code has been moved into the existing
        charm-tools source package.  The packaging for it is a bit
        hairy as that source package already builds two binary
        packages and the new Python bits would add a third.  Clint
        offered suggestions on how to do it but they didn’t work so he
        has taken the task of getting the packaging to work for our
        additions to charm-tools.

        * Code specific to the buildbot charms, shared between master
        and slave, is currently duplicated via the locals.py files.
        This approach works but will be replaced by a PPA.


    * [Francesco: slack] lpsetup:

        * Added buildout files with testing support.

        * Added unit tests (for argparser, handlers and utils)

        * The env var LANG=C is set during the installation of launchpad
        developer packages. This way we can avoid installing
        language-pack-en.

        * Updated the install and lxc-install sub commands to support a
        custom ssh key name. The root ssh key is no longer needed, so
        it is not created anymore.

        * Created the recipe for debian packaging.

== Progress on tracked items ==

=== Completed by others ===

    * LXC: 925024 - apparmor makes it impossible to install
    postgresql-common on Precise

    * LXC: aufs option should be added to lxc-start-ephemeral

=== New and incomplete ===

    * LXC 959352: Ephemeral containers have "/rootfs" prefix in
    /proc/self/maps entries HIGH OR CRITICAL

    * Testrepository 949950 (mentioned but not filed last time):
    testrepository show full subunit stream of running tests HIGH

    * Testrepository 957145: force amount of parallelization, overriding
    reported cores

    * 961103: testrepository “String or Integer object expected for
    key, unicode found”


=== Carried over and incomplete ===

    * 914166 - Zope layer setup and teardown 'tests' cannot be filtered
    by testr

        * no activity in the last eight weeks

    * RT 50242 - get a buildbot machine for testing

        * actively in progress again

        * We prefer EC2 for tests

== Goals for next meeting ==


 1.  Bug fixes for test failures discovered in parallel test runs.
 Already known targets:

    * Launchpad

        * 953913: test isolation error

        * /dev/random exhaustion solution in setuplxc

    * Testrepository/zope.testing

        * 609986: subunit support for layer failures

    * Buildbot improvements

        * clean up old broken ephemeral lxc containers

        * keep .testrepository data around between builds

        * report failures more accurately

        * make tests always randomly ordered

    * Tracking

        * LXC 959352

 2. Deliver scaling assessment based on experimental results, using ec2
 (carry over from previous two weeks)

 3. Get data center box running tests, and have a single comparison run
 with ec2.
    * /dev/random exhaustion solution approved and installed

-- 
Francis J. Lacoste
francis.lacoste@xxxxxxxxxxxxx
Attachment: signature.asc
Description: OpenPGP digital signature