launchpad-reviewers team mailing list archive
-
launchpad-reviewers team
-
Mailing list archive
-
Message #05654
[Merge] lp:~jtv/launchpad/bug-717969 into lp:launchpad
The proposal to merge lp:~jtv/launchpad/bug-717969 into lp:launchpad has been updated.
Description changed to:
= Bug 717969 =
This is an oversized branch. Don't just pick it up out of a sense of duty; if it's too big or you're not certain about the domain-specific stuff, contact one of the Soyuz experts — Julian, Steve, or William.
The buildmaster was showing very strange symptoms. In a nutshell, cases where observed where it went through steps A, B, and C; we could see that steps A and C were completed successfully; but the expected results from step B were just not there.
The cause turns out to be a dangerous mix of global state: our database transaction management is global to a thread (all of the thread's changes are committed or all are aborted) but our use of twisted makes a single python thread/process flit back and forth between servicing multiple logical threads of execution.
Each of those threads of execution might, from its point of view: make some changes to ORM-backed objects, then initiate an asynchronous request (talking to the librarian or a build slave), block for the result to arrive, make some more ORM changes, block again, and later commit or abort the database transaction.
That is purely from the point of view of a logical thread of execution. Twisted experts are quick to point out that “nothing blocks” in Twisted, but that is exactly the problem: while one logical thread of execution “blocks,” another gets to execute. And who's to say whether that other one might not commit or abort the first one's changes prematurely? With the numbers of builders we drive, it's bound to happen regularly.
We discussed a few architectural solutions: farming off the ORM changes to a worker thread, doing all the work in threads, and so on. The real problem is that the ORM changes are inlined in the buildmaster, but largely hidden away in the various build-related class hierarchies. Twisted is meant for glue between systems, not for interleaving complex and dispersed stateful machinery with the glue's callback structure.
Ideally, we should separate the “glue” from the “machinery.” But it's complex and fragile, so the next step from here is to get the code under control to the point where we can reason reliably about it. Once we know more about what the code really does, we'll have more freedom to reorganize it.
This branch represents that next step from here: make the buildmaster run in read-only transaction policy. It will not be able to make any changes to the database (or ORM-backed objects), except in sections of code that are explicitly wrapped in read-write transaction policies. Now we know exactly where the buildmaster may change the database — doing it anywhere else would be an error. We keep the read-write sections absolutely minimal, and try to avoid method calls where Twisted context switches might hide. Unfortunately neither Twisted nor Storm seems to have a facility for forbidding them altogether.
Every read-write section commits immediately. That means more fine-grained state changes. I can't be 100% certain that early commits will not produce any unwanted object states, and I can't be 100% certain that the read-write policies cover all code paths that need them. But we've run several types of build through the patched build farm, and we've stress-tested it against about 5 concurrent builders working all-out. As expected, we found code that still lacked an explicit read-write policy — but only one of them. There may be more, but only production use will find them all.
For Q/A, we once again perform builds on staging and/or dogfood, of as many kinds as we can. Include concurrent builds, and successes as well as failures. Verify that the build products (tarballs, packages, changes files, build logs) all go into the librarian, and that the builds end up in a proper terminal state. The contents of the build logs should be consistent with that end state.
As for tests… sorry, I just run them all!
No lint that I could help, copyrights updated, imports formatted as per policy. Thank you for reading this far.
Jeroen
For more details, see:
https://code.launchpad.net/~jtv/launchpad/bug-717969/+merge/83022
--
https://code.launchpad.net/~jtv/launchpad/bug-717969/+merge/83022
Your team Launchpad code reviewers is requested to review the proposed merge of lp:~jtv/launchpad/bug-717969 into lp:launchpad.
References