yellow team mailing list archive
-
yellow team
-
Mailing list archive
-
Message #00601
Notes from trying to run tests: much success, a big milestone, still a lot to do
The great news:
- It looks like we have testr running to completion;
- we have confirmation that Francesco's testr unicode fix is what we need;
- most broken tests will be fixed when we address the graph thing Benji
talked about in his Friday email;
- our ability to quickly change machines and run tests and tear things
down quickly has been a real win lately, at least for what I've been
doing, and I think it will continue to be so, which we can point to as
some validation of our approach of setting up the automation machinery
(the juju work) first; and
- because of our most recent successes, we can probably do some rough
timing tests early this week on a super-big ec2 machine to try and
deliver on our bi-weekly goals of evaluating the effect of multiple
cores, even without the IS machine being ready.
The additional work, as noted so far:
- Benji's work on the graphing stuff is the next pointer to what to work
on to fix a bunch of tests at once;
- we have just a few more test isolation bugs remaining, at least with
the info we have so far;
- testrepository and associated subunit support in the testrunner have a
number of fragilities, it seems;
- our buildbot setup needs to do some cleanup;
- and other things I've forgotten.
Additionally, the bottom (and bulk) of this email is a diary of sorts of
the tests I've run and looked at over the past three days. Perhaps you
can look through it and help me identify what else we need to do. I'll
be doing that as well on Monday.
As to my nausea, my stomach issues are almost over, though I still am
dealing with some aftereffects like being easily tired.
Thanks,
Gary
Test diary:
- First attempt had the OS incorrectly reporting the number of CPUs,
which meant that the test only ran with a single process. I filed
https://bugs.launchpad.net/testrepository/+bug/957145 .
- I kept it going to see how it would go with a single instance, because
then we could isolate issues with lxc versus issues with test isolation.
Results, indicating lxc-related issues (or at least not isolation
issues) are here: http://pastebin.ubuntu.com/887314/ . It sounds like
Benji is hot on the heels of a big chunk of those.
- As you can see, buildbot killed the test because it took longer than
the timeout for "no output". We probably ought to increase it.
- After buildbot killed the test run, the layer subprocess kept going.
I had to kill it manually. For some reason, the lxc cleanup hadn't
happened either: I had to kill the lxc, and I should have unmounted the
various directories. That would ideally be fixed. We could make an
initial buildbot step that cleaned up any old bits. Here's an example
of what one has to do manually as root in this case, for reference.
# umount /var/lib/lxc/lptests-temp-JippS8O/ephemeralbind/
# umount /var/lib/lxc/lptests-temp-JippS8O/
# rm -rf /var/lib/lxc/lptests-temp-JippS8O/
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 8.0G 4.4G 3.3G 58% /
udev 828M 12K 828M 1% /dev
tmpfs 334M 180K 334M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 834M 0 834M 0% /run/shm
/dev/xvda2 147G 188M 140G 1% /mnt
cgroup 834M 0 834M 0% /sys/fs/cgroup
none 834M 291M 543M 35% /tmp/lxc-lp-LMq56LW
# umount /tmp/lxc-lp-LMq56LW
- On the next run, I forced a parallel run by hacking the code. These
were the results: http://pastebin.ubuntu.com/887627/ . I wondered if
the memory problems were because I had not unmounted the overlayfs
directories from the previous run. I killed the lxc, and then I
unmounted both the older and newer sets of directories. I forgot to
check whether there was still a test process running, and if it were
doing something; that would have been good to know, because it would
have suggested whether we still had a hanging problem.
We currently have the timeout set at 6 hours. I considered increasing
it, but decided not to, hoping that this was in fact enough.
- The big news for the next run was that it ran to completion, without
hanging! It showed some fragility in testrepository + the zope subunit
implementation (see all those success messages that testrepository is
supposed to hide ATM). It also showed, it seems, that the memory on my
ec2 machine was not big enough. I thought I was running a m1.large, but
I was apparently not--just the juju default, which is pretty small. I
decided to destroy the environment and run the tests again on a bigger
machine. In any case, the results are here:
http://pastebin.ubuntu.com/888502/ . Note that we still have the need
for the testrepository fix on line 3665. I wanted to dupe the problem
and *then* try to apply the patch...but I'll do that on a big machine now.
- I ran the juju setup on an m1.large instance. The results were *very*
encouraging: http://pastebin.ubuntu.com/889006/ . The only mystery I
saw there was the
lp.services.job.tests.test_runner.TestTwistedJobRunner.test_memory_hog_job
failure. Other than that, it looked like Benji's discovery of the
graphing problem from Friday would solve the bulk of them, and then
there was a known isolation problem, and...the testrepository Unicode
issue on line 163.
- As an aside, I was reminded that our time is more expensive than
ec2's. I probably should have been using the faster ec2 machines all of
this time, at least when we were waiting on setuplxc, and so should all
of us. I had forgotten how to do it: in addition to setting
"default-instance-type: m1.large" in ~/.juju/environments.yaml, you have
to set "default-image-id:" to an appropriate value from
http://uec-images.ubuntu.com/query/precise/server/released.txt .
Finding your region's "amd64," "ebs" release of the most recent precise
version ("beta1" for me) does the trick, though ISTR that
"instance-store" (rather than "ebs") also works.
- As another aside, it struck me that, even as we seem to be ready for
parallelization tests, different test orderings will very likely result
in discovering new test isolation issues. Also, I wonder if the second
identical runs, when testrepository orders tests not by round-robin but
by...some other mechanism that is supposed to be better in some
way...ought to be compared, not the first.
- It's become clear in the last two test runs, which actually completed,
that we need buildbot to actually understand these test results. It at
least is understanding the exit code, and that is being set properly,
but it is not properly adding up test failures, so reporting is off.
Fixing that, at least minimally, is probably a requirement. Robert
added subunit support to buildbot at one point, but it is not available
in the version that we are using, AFAIK. That is probably the right way
to go.
- An aside: I need to write an email to Serge that asks him whether/how
we are supposed to hook up the local dnsmasq, per Martin's email from
Friday; and...something else that I forget that is actually directly
pertinent to these tests. :-/
- It strikes me that we have a number of must-fix testrepository issues,
some of which may be difficult to trigger:
* The unicode issue,
* The fragility in fragility in testrepository + the zope subunit
implementation that I mentioned above,
* "String or Integer object expected for key, unicode
found"(https://bugs.launchpad.net/testrepository/+bug/775214 ?)
- And now off for another test run that tries to include Francesco's
testrepository fix. This will/should be run using a different test
ordering, AIUI.
35a36
> encoding = getattr(self.stream, 'encoding', None) or 'utf-8'
41c42
< ])
---
> ]).encode(encoding, 'replace')
I did not see the complaint in the next results:
http://pastebin.ubuntu.com/889579/ . I think we have a fix!
- Looking further at the results, we still have "String or Integer
object expected for key, unicode found". It looks like there might be
anoother isolation error to sink our teeth into.
lp.translations.tests.test_rosetta_branches_script.TestRosettaBranchesScript.test_rosetta_branches_script_oops
. Also, the ending shows a memcache setup problem similar to what can
be found in the previous test run. It seems to fall over badly from the
perspective of subunit, but that may very will simply by fallout from
that bug I have hanging open on the kanban board (609986).
- I also investigated the YUIAppServerLayer and YUITestLayer tests,
since they had failed for me locally. One, the YUIAppServerLayer,
appears to be both failing and reporting incorrectly, so two bugs.
Here's what I found for YUIAppServerLayer:
test: lp.testing.layers.YUIAppServerLayer:setUp
time: 2012-03-18 16:42:34.045308Z
successful: lp.testing.layers.YUIAppServerLayer:setUp [ multipart
]
time: 2012-03-18 16:42:34.045308Z
test: Could not communicate with subprocess
time: 2012-03-18 16:42:34.045308Z
successful: Could not communicate with subprocess [ multipart
]
That is probably just a timeout issue, and is probably related to a card
I've had on the board since the beginning about tests in this layer, but
that certainly should not be reported as a success. On the other hand,
the pure JS tests seemed to run and report fine:
test: lp.testing.layers.YUITestLayer:setUp
time: 2012-03-18 16:43:11.340811Z
successful: lp.testing.layers.YUITestLayer:setUp [ multipart
]
time: 2012-03-18 16:43:11.340811Z
test: lib/lp/app/javascript/formwidgets/tests/test_formwidgets.html
time: 2012-03-18 16:43:11.340811Z
successful:
lib/lp/app/javascript/formwidgets/tests/test_formwidgets.html [ multipart
]
time: 2012-03-18 16:43:11.340811Z
test: lib/lp/app/javascript/formwidgets/tests/test_resizing_textarea.html
time: 2012-03-18 16:43:11.340811Z
successful:
lib/lp/app/javascript/formwidgets/tests/test_resizing_textarea.html [
multipart
]
time: 2012-03-18 16:41:32.437724Z
test:
lp.services.scripts.tests.test_all_scripts.ScriptsTestCase.script_garbo-frequently
time: 2012-03-18 16:41:32.437724Z
successful:
lp.services.scripts.tests.test_all_scripts.ScriptsTestCase.script_garbo-frequently
[ multipart
]
...and so on...
Follow ups