← Back to team overview

yellow team mailing list archive

Notes from trying to run tests: much success, a big milestone, still a lot to do

 

The great news:

- It looks like we have testr running to completion;
- we have confirmation that Francesco's testr unicode fix is what we need;
- most broken tests will be fixed when we address the graph thing Benji talked about in his Friday email; - our ability to quickly change machines and run tests and tear things down quickly has been a real win lately, at least for what I've been doing, and I think it will continue to be so, which we can point to as some validation of our approach of setting up the automation machinery (the juju work) first; and - because of our most recent successes, we can probably do some rough timing tests early this week on a super-big ec2 machine to try and deliver on our bi-weekly goals of evaluating the effect of multiple cores, even without the IS machine being ready.

The additional work, as noted so far:

- Benji's work on the graphing stuff is the next pointer to what to work on to fix a bunch of tests at once; - we have just a few more test isolation bugs remaining, at least with the info we have so far; - testrepository and associated subunit support in the testrunner have a number of fragilities, it seems;
- our buildbot setup needs to do some cleanup;
- and other things I've forgotten.

Additionally, the bottom (and bulk) of this email is a diary of sorts of the tests I've run and looked at over the past three days. Perhaps you can look through it and help me identify what else we need to do. I'll be doing that as well on Monday.

As to my nausea, my stomach issues are almost over, though I still am dealing with some aftereffects like being easily tired.

Thanks,

Gary

Test diary:

- First attempt had the OS incorrectly reporting the number of CPUs, which meant that the test only ran with a single process. I filed https://bugs.launchpad.net/testrepository/+bug/957145 .

- I kept it going to see how it would go with a single instance, because then we could isolate issues with lxc versus issues with test isolation. Results, indicating lxc-related issues (or at least not isolation issues) are here: http://pastebin.ubuntu.com/887314/ . It sounds like Benji is hot on the heels of a big chunk of those.

- As you can see, buildbot killed the test because it took longer than the timeout for "no output". We probably ought to increase it.

- After buildbot killed the test run, the layer subprocess kept going. I had to kill it manually. For some reason, the lxc cleanup hadn't happened either: I had to kill the lxc, and I should have unmounted the various directories. That would ideally be fixed. We could make an initial buildbot step that cleaned up any old bits. Here's an example of what one has to do manually as root in this case, for reference.

# umount /var/lib/lxc/lptests-temp-JippS8O/ephemeralbind/
# umount /var/lib/lxc/lptests-temp-JippS8O/
# rm -rf /var/lib/lxc/lptests-temp-JippS8O/
# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      8.0G  4.4G  3.3G  58% /
udev            828M   12K  828M   1% /dev
tmpfs           334M  180K  334M   1% /run
none            5.0M     0  5.0M   0% /run/lock
none            834M     0  834M   0% /run/shm
/dev/xvda2      147G  188M  140G   1% /mnt
cgroup          834M     0  834M   0% /sys/fs/cgroup
none            834M  291M  543M  35% /tmp/lxc-lp-LMq56LW
# umount /tmp/lxc-lp-LMq56LW


- On the next run, I forced a parallel run by hacking the code. These were the results: http://pastebin.ubuntu.com/887627/ . I wondered if the memory problems were because I had not unmounted the overlayfs directories from the previous run. I killed the lxc, and then I unmounted both the older and newer sets of directories. I forgot to check whether there was still a test process running, and if it were doing something; that would have been good to know, because it would have suggested whether we still had a hanging problem.

We currently have the timeout set at 6 hours. I considered increasing it, but decided not to, hoping that this was in fact enough.

- The big news for the next run was that it ran to completion, without hanging! It showed some fragility in testrepository + the zope subunit implementation (see all those success messages that testrepository is supposed to hide ATM). It also showed, it seems, that the memory on my ec2 machine was not big enough. I thought I was running a m1.large, but I was apparently not--just the juju default, which is pretty small. I decided to destroy the environment and run the tests again on a bigger machine. In any case, the results are here: http://pastebin.ubuntu.com/888502/ . Note that we still have the need for the testrepository fix on line 3665. I wanted to dupe the problem and *then* try to apply the patch...but I'll do that on a big machine now.

- I ran the juju setup on an m1.large instance. The results were *very* encouraging: http://pastebin.ubuntu.com/889006/ . The only mystery I saw there was the lp.services.job.tests.test_runner.TestTwistedJobRunner.test_memory_hog_job failure. Other than that, it looked like Benji's discovery of the graphing problem from Friday would solve the bulk of them, and then there was a known isolation problem, and...the testrepository Unicode issue on line 163.

- As an aside, I was reminded that our time is more expensive than ec2's. I probably should have been using the faster ec2 machines all of this time, at least when we were waiting on setuplxc, and so should all of us. I had forgotten how to do it: in addition to setting "default-instance-type: m1.large" in ~/.juju/environments.yaml, you have to set "default-image-id:" to an appropriate value from http://uec-images.ubuntu.com/query/precise/server/released.txt . Finding your region's "amd64," "ebs" release of the most recent precise version ("beta1" for me) does the trick, though ISTR that "instance-store" (rather than "ebs") also works.

- As another aside, it struck me that, even as we seem to be ready for parallelization tests, different test orderings will very likely result in discovering new test isolation issues. Also, I wonder if the second identical runs, when testrepository orders tests not by round-robin but by...some other mechanism that is supposed to be better in some way...ought to be compared, not the first.

- It's become clear in the last two test runs, which actually completed, that we need buildbot to actually understand these test results. It at least is understanding the exit code, and that is being set properly, but it is not properly adding up test failures, so reporting is off. Fixing that, at least minimally, is probably a requirement. Robert added subunit support to buildbot at one point, but it is not available in the version that we are using, AFAIK. That is probably the right way to go.

- An aside: I need to write an email to Serge that asks him whether/how we are supposed to hook up the local dnsmasq, per Martin's email from Friday; and...something else that I forget that is actually directly pertinent to these tests. :-/

- It strikes me that we have a number of must-fix testrepository issues, some of which may be difficult to trigger:
  * The unicode issue,
* The fragility in fragility in testrepository + the zope subunit implementation that I mentioned above, * "String or Integer object expected for key, unicode found"(https://bugs.launchpad.net/testrepository/+bug/775214 ?)

- And now off for another test run that tries to include Francesco's testrepository fix. This will/should be run using a different test ordering, AIUI.

35a36
>         encoding = getattr(self.stream, 'encoding', None) or 'utf-8'
41c42
<             ])
---
>             ]).encode(encoding, 'replace')

I did not see the complaint in the next results: http://pastebin.ubuntu.com/889579/ . I think we have a fix!

- Looking further at the results, we still have "String or Integer object expected for key, unicode found". It looks like there might be anoother isolation error to sink our teeth into. lp.translations.tests.test_rosetta_branches_script.TestRosettaBranchesScript.test_rosetta_branches_script_oops . Also, the ending shows a memcache setup problem similar to what can be found in the previous test run. It seems to fall over badly from the perspective of subunit, but that may very will simply by fallout from that bug I have hanging open on the kanban board (609986).

- I also investigated the YUIAppServerLayer and YUITestLayer tests, since they had failed for me locally. One, the YUIAppServerLayer, appears to be both failing and reporting incorrectly, so two bugs. Here's what I found for YUIAppServerLayer:

test: lp.testing.layers.YUIAppServerLayer:setUp
time: 2012-03-18 16:42:34.045308Z
successful: lp.testing.layers.YUIAppServerLayer:setUp [ multipart
]
time: 2012-03-18 16:42:34.045308Z
test: Could not communicate with subprocess
time: 2012-03-18 16:42:34.045308Z
successful: Could not communicate with subprocess [ multipart
]

That is probably just a timeout issue, and is probably related to a card I've had on the board since the beginning about tests in this layer, but that certainly should not be reported as a success. On the other hand, the pure JS tests seemed to run and report fine:

test: lp.testing.layers.YUITestLayer:setUp
time: 2012-03-18 16:43:11.340811Z
successful: lp.testing.layers.YUITestLayer:setUp [ multipart
]
time: 2012-03-18 16:43:11.340811Z
test: lib/lp/app/javascript/formwidgets/tests/test_formwidgets.html
time: 2012-03-18 16:43:11.340811Z
successful: lib/lp/app/javascript/formwidgets/tests/test_formwidgets.html [ multipart
]
time: 2012-03-18 16:43:11.340811Z
test: lib/lp/app/javascript/formwidgets/tests/test_resizing_textarea.html
time: 2012-03-18 16:43:11.340811Z
successful: lib/lp/app/javascript/formwidgets/tests/test_resizing_textarea.html [ multipart
]
time: 2012-03-18 16:41:32.437724Z
test: lp.services.scripts.tests.test_all_scripts.ScriptsTestCase.script_garbo-frequently
time: 2012-03-18 16:41:32.437724Z
successful: lp.services.scripts.tests.test_all_scripts.ScriptsTestCase.script_garbo-frequently [ multipart
]
...and so on...


Follow ups