yellow team mailing list archive

Thread
Date

Notes from trying to run tests: much success, a big milestone, still a lot to do

To: "yellow >> Launchpad Yellow Squad" <yellow@xxxxxxxxxxxxxxxxxxx>
From: Gary Poster <gary.poster@xxxxxxxxxxxxx>
Date: Sun, 18 Mar 2012 15:37:57 -0400
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120313 Thunderbird/11.0

The great news:

- It looks like we have testr running to completion;
- we have confirmation that Francesco's testr unicode fix is what we need;

- most broken tests will be fixed when we address the graph thing Benjitalked about in his Friday email;- our ability to quickly change machines and run tests and tear thingsdown quickly has been a real win lately, at least for what I've beendoing, and I think it will continue to be so, which we can point to assome validation of our approach of setting up the automation machinery(the juju work) first; and- because of our most recent successes, we can probably do some roughtiming tests early this week on a super-big ec2 machine to try anddeliver on our bi-weekly goals of evaluating the effect of multiplecores, even without the IS machine being ready.


The additional work, as noted so far:

- Benji's work on the graphing stuff is the next pointer to what to workon to fix a bunch of tests at once;- we have just a few more test isolation bugs remaining, at least withthe info we have so far;- testrepository and associated subunit support in the testrunner have anumber of fragilities, it seems;

- our buildbot setup needs to do some cleanup;
- and other things I've forgotten.

Additionally, the bottom (and bulk) of this email is a diary of sorts ofthe tests I've run and looked at over the past three days. Perhaps youcan look through it and help me identify what else we need to do. I'llbe doing that as well on Monday.

As to my nausea, my stomach issues are almost over, though I still amdealing with some aftereffects like being easily tired.


Thanks,

Gary

Test diary:

- First attempt had the OS incorrectly reporting the number of CPUs,which meant that the test only ran with a single process. I filedhttps://bugs.launchpad.net/testrepository/+bug/957145 .

- I kept it going to see how it would go with a single instance, becausethen we could isolate issues with lxc versus issues with test isolation.Results, indicating lxc-related issues (or at least not isolationissues) are here: http://pastebin.ubuntu.com/887314/ . It sounds likeBenji is hot on the heels of a big chunk of those.

- As you can see, buildbot killed the test because it took longer thanthe timeout for "no output". We probably ought to increase it.

- After buildbot killed the test run, the layer subprocess kept going.I had to kill it manually. For some reason, the lxc cleanup hadn'thappened either: I had to kill the lxc, and I should have unmounted thevarious directories. That would ideally be fixed. We could make aninitial buildbot step that cleaned up any old bits. Here's an exampleof what one has to do manually as root in this case, for reference.


# umount /var/lib/lxc/lptests-temp-JippS8O/ephemeralbind/
# umount /var/lib/lxc/lptests-temp-JippS8O/
# rm -rf /var/lib/lxc/lptests-temp-JippS8O/
# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      8.0G  4.4G  3.3G  58% /
udev            828M   12K  828M   1% /dev
tmpfs           334M  180K  334M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            834M     0  834M   0% /run/shm
/dev/xvda2      147G  188M  140G   1% /mnt
cgroup          834M     0  834M   0% /sys/fs/cgroup
none            834M  291M  543M  35% /tmp/lxc-lp-LMq56LW
# umount /tmp/lxc-lp-LMq56LW

- On the next run, I forced a parallel run by hacking the code. Thesewere the results: http://pastebin.ubuntu.com/887627/ . I wondered ifthe memory problems were because I had not unmounted the overlayfsdirectories from the previous run. I killed the lxc, and then Iunmounted both the older and newer sets of directories. I forgot tocheck whether there was still a test process running, and if it weredoing something; that would have been good to know, because it wouldhave suggested whether we still had a hanging problem.

We currently have the timeout set at 6 hours. I considered increasingit, but decided not to, hoping that this was in fact enough.

- The big news for the next run was that it ran to completion, withouthanging! It showed some fragility in testrepository + the zope subunitimplementation (see all those success messages that testrepository issupposed to hide ATM). It also showed, it seems, that the memory on myec2 machine was not big enough. I thought I was running a m1.large, butI was apparently not--just the juju default, which is pretty small. Idecided to destroy the environment and run the tests again on a biggermachine. In any case, the results are here:http://pastebin.ubuntu.com/888502/ . Note that we still have the needfor the testrepository fix on line 3665. I wanted to dupe the problemand *then* try to apply the patch...but I'll do that on a big machine now.

- I ran the juju setup on an m1.large instance. The results were *very*encouraging: http://pastebin.ubuntu.com/889006/ . The only mystery Isaw there was thelp.services.job.tests.test_runner.TestTwistedJobRunner.test_memory_hog_jobfailure. Other than that, it looked like Benji's discovery of thegraphing problem from Friday would solve the bulk of them, and thenthere was a known isolation problem, and...the testrepository Unicodeissue on line 163.

- As an aside, I was reminded that our time is more expensive thanec2's. I probably should have been using the faster ec2 machines all ofthis time, at least when we were waiting on setuplxc, and so should allof us. I had forgotten how to do it: in addition to setting"default-instance-type: m1.large" in ~/.juju/environments.yaml, you haveto set "default-image-id:" to an appropriate value fromhttp://uec-images.ubuntu.com/query/precise/server/released.txt .Finding your region's "amd64," "ebs" release of the most recent preciseversion ("beta1" for me) does the trick, though ISTR that"instance-store" (rather than "ebs") also works.

- As another aside, it struck me that, even as we seem to be ready forparallelization tests, different test orderings will very likely resultin discovering new test isolation issues. Also, I wonder if the secondidentical runs, when testrepository orders tests not by round-robin butby...some other mechanism that is supposed to be better in someway...ought to be compared, not the first.

- It's become clear in the last two test runs, which actually completed,that we need buildbot to actually understand these test results. It atleast is understanding the exit code, and that is being set properly,but it is not properly adding up test failures, so reporting is off.Fixing that, at least minimally, is probably a requirement. Robertadded subunit support to buildbot at one point, but it is not availablein the version that we are using, AFAIK. That is probably the right wayto go.

- An aside: I need to write an email to Serge that asks him whether/howwe are supposed to hook up the local dnsmasq, per Martin's email fromFriday; and...something else that I forget that is actually directlypertinent to these tests. :-/

- It strikes me that we have a number of must-fix testrepository issues,some of which may be difficult to trigger:

  * The unicode issue,

* The fragility in fragility in testrepository + the zope subunitimplementation that I mentioned above,* "String or Integer object expected for key, unicodefound"(https://bugs.launchpad.net/testrepository/+bug/775214 ?)

- And now off for another test run that tries to include Francesco'stestrepository fix. This will/should be run using a different testordering, AIUI.


35a36
>         encoding = getattr(self.stream, 'encoding', None) or 'utf-8'
41c42
<             ])
---
>             ]).encode(encoding, 'replace')

I did not see the complaint in the next results:http://pastebin.ubuntu.com/889579/ . I think we have a fix!

- Looking further at the results, we still have "String or Integerobject expected for key, unicode found". It looks like there might beanoother isolation error to sink our teeth into.lp.translations.tests.test_rosetta_branches_script.TestRosettaBranchesScript.test_rosetta_branches_script_oops. Also, the ending shows a memcache setup problem similar to what canbe found in the previous test run. It seems to fall over badly from theperspective of subunit, but that may very will simply by fallout fromthat bug I have hanging open on the kanban board (609986).

- I also investigated the YUIAppServerLayer and YUITestLayer tests,since they had failed for me locally. One, the YUIAppServerLayer,appears to be both failing and reporting incorrectly, so two bugs.Here's what I found for YUIAppServerLayer:


test: lp.testing.layers.YUIAppServerLayer:setUp
time: 2012-03-18 16:42:34.045308Z
successful: lp.testing.layers.YUIAppServerLayer:setUp [ multipart
]
time: 2012-03-18 16:42:34.045308Z
test: Could not communicate with subprocess
time: 2012-03-18 16:42:34.045308Z
successful: Could not communicate with subprocess [ multipart
]

That is probably just a timeout issue, and is probably related to a cardI've had on the board since the beginning about tests in this layer, butthat certainly should not be reported as a success. On the other hand,the pure JS tests seemed to run and report fine:


test: lp.testing.layers.YUITestLayer:setUp
time: 2012-03-18 16:43:11.340811Z
successful: lp.testing.layers.YUITestLayer:setUp [ multipart
]
time: 2012-03-18 16:43:11.340811Z
test: lib/lp/app/javascript/formwidgets/tests/test_formwidgets.html
time: 2012-03-18 16:43:11.340811Z

successful:lib/lp/app/javascript/formwidgets/tests/test_formwidgets.html [ multipart

]
time: 2012-03-18 16:43:11.340811Z
test: lib/lp/app/javascript/formwidgets/tests/test_resizing_textarea.html
time: 2012-03-18 16:43:11.340811Z

successful:lib/lp/app/javascript/formwidgets/tests/test_resizing_textarea.html [multipart

]
time: 2012-03-18 16:41:32.437724Z

test:lp.services.scripts.tests.test_all_scripts.ScriptsTestCase.script_garbo-frequently

time: 2012-03-18 16:41:32.437724Z

successful:lp.services.scripts.tests.test_all_scripts.ScriptsTestCase.script_garbo-frequently[ multipart

]
...and so on...

Follow ups

Re: Notes from trying to run tests: much success, a big milestone, still a lot to do
From: Gary Poster, 2012-03-19