ubuntu-phone team mailing list archive
-
ubuntu-phone team
-
Mailing list archive
-
Message #06641
D-Bus flakiness during tests (was Re: Landing team 25.02.14)
On Feb 26, 2014, at 03:37 PM, Bill Filler wrote:
>I'm looking at this with the help of Omer. It's very strange as the
>failure is simply trying to switch from the main view to the Albums tab
>(via the ubuntu-ui-toolkit-emulator classes) and that operation does not
>succeed. It's getting a time out from dbus. This issue seems to occur
>in other smoketest failures as well, like this one
>http://ci.ubuntu.com/smokeng/trusty/touch/mako/209:20140226.1:20140224/6842/gallery_app/818684/.
>
>The test seems to be written correctly and I can't reproduce it on a
>device despite running it multiple times. Omer (or other AP experts) are
>needed here for the next step of evaluation.
This looks suspiciously like timeouts I see when running the system-image test
suite. I don't know how your tests are set up or under what environment
they're run, but I'm pretty well convinced there's some unaccounted for
flakiness in D-Bus in some test environments.
Actually, I think there are two general D-Bus problems.
* dbus-daemon SIGHUP race conditions.
In my system-image test suite, I start up a dbus-daemon with some custom
system bus services. My service config files change depending on the test
being run, but you cannot kill and restart dbus-daemon when this happens
because libdbus only reads its private-bus environment variables once when
the library initializes. The solution that works is to write the new config
files, then SIGHUP dbus-daemon, which tells it to re-read its config files.
Alternatively, you can call the ReloadConfig() on the org.freedesktop.DBus
interface's / object.
On a sufficiently beefy desktop box, this is quite reliable. Not so much on
the buildds. Clearly there is a race condition even when ReloadConfig() is
used. Even if the new configs are already in place when you
SIGHUP/ReloadConfig(), dbus-daemon will sometimes complain that there is no
.service file available for the service you're trying to D-Bus activate. I
can see why SIGHUP would be racy, and my guess is that ReloadConfig() boils
down to just SIGHUPing rather than doing something more sane, like
synchronously reloading the configs and not returning until all of its data
structures are up-to-date.
I've taken to programming defensively around this one:
OVERRIDE = os.environ.get('SYSTEMIMAGE_DBUS_DAEMON_HUP_SLEEP_SECONDS')
HUP_SLEEP = (0 if OVERRIDE is None else int(OVERRIDE))
...
def blah():
service = dbus.SystemBus().get_object('org.freedesktop.DBus', '/')
iface = dbus.Interface(service, 'org.freedesktop.DBus')
iface.ReloadConfig()
time.sleep(HUP_SLEEP)
When developing on my desktop, this calls time.sleep(0) which as I say
always works on my development machine. Then in my d/rules I set
SYSTEMIMAGE_DBUS_DAEMON_HUP_SLEEP_SECONDS to 2 so that on the buildd's
there's a short blocking delay before continuing on to D-Bus activation.
2 seconds is the Goldilocks value, 1 is definitely too short. So far, with
this change I've been able to very reliably avoid this particular problem on
the buildds.
* Random D-Bus timeouts.
This one is tougher, and more probably similar to what you're seeing. Just
every once in a while I get random D-Bus timeouts in response to some
messages. I've seen this on my desktop and laptop, and in PPAs and archive
buildds, even when the systems do not seem to be overloaded. There's no
pattern that I can see related to *which* methods timeout - sometimes they
break a test, other times they don't. Sometimes they are messages between
system-image-dbus and its client, and other times they are between
ubuntu-download-manager and system-image-dbus. I have no clue what's
causing them, but fortunately(?), they're rare-ish for me.
The only workaround I've come up with is to retry the test when developing
locally, or to retry the build. Usually the second or even more rarely
third time, it will Just Work.
I'd love to know more about what's going on here, but I suspect we're both
seeing symptoms of the same problem.
Cheers,
-Barry
References