← Back to team overview

ubuntu-phone team mailing list archive

D-Bus flakiness during tests (was Re: Landing team 25.02.14)

 

On Feb 26, 2014, at 03:37 PM, Bill Filler wrote:

>I'm looking at this with the help of Omer. It's very strange as the
>failure is simply trying to switch from the main view to the Albums tab
>(via the ubuntu-ui-toolkit-emulator classes) and that operation does not
>succeed. It's getting  a time out from dbus. This issue seems to occur
>in other smoketest failures as well, like this one
>http://ci.ubuntu.com/smokeng/trusty/touch/mako/209:20140226.1:20140224/6842/gallery_app/818684/.
>
>The test seems to be written correctly and I can't reproduce it on a
>device despite running it multiple times. Omer (or other AP experts) are
>needed here for the next step of evaluation.

This looks suspiciously like timeouts I see when running the system-image test
suite.  I don't know how your tests are set up or under what environment
they're run, but I'm pretty well convinced there's some unaccounted for
flakiness in D-Bus in some test environments.

Actually, I think there are two general D-Bus problems.

* dbus-daemon SIGHUP race conditions.

  In my system-image test suite, I start up a dbus-daemon with some custom
  system bus services.  My service config files change depending on the test
  being run, but you cannot kill and restart dbus-daemon when this happens
  because libdbus only reads its private-bus environment variables once when
  the library initializes.  The solution that works is to write the new config
  files, then SIGHUP dbus-daemon, which tells it to re-read its config files.
  Alternatively, you can call the ReloadConfig() on the org.freedesktop.DBus
  interface's / object.

  On a sufficiently beefy desktop box, this is quite reliable.  Not so much on
  the buildds.  Clearly there is a race condition even when ReloadConfig() is
  used.  Even if the new configs are already in place when you
  SIGHUP/ReloadConfig(), dbus-daemon will sometimes complain that there is no
  .service file available for the service you're trying to D-Bus activate.  I
  can see why SIGHUP would be racy, and my guess is that ReloadConfig() boils
  down to just SIGHUPing rather than doing something more sane, like
  synchronously reloading the configs and not returning until all of its data
  structures are up-to-date.

  I've taken to programming defensively around this one:

  OVERRIDE = os.environ.get('SYSTEMIMAGE_DBUS_DAEMON_HUP_SLEEP_SECONDS')
  HUP_SLEEP = (0 if OVERRIDE is None else int(OVERRIDE))
  ...
  def blah():
    service = dbus.SystemBus().get_object('org.freedesktop.DBus', '/')
    iface = dbus.Interface(service, 'org.freedesktop.DBus')
    iface.ReloadConfig()
    time.sleep(HUP_SLEEP)

  When developing on my desktop, this calls time.sleep(0) which as I say
  always works on my development machine.  Then in my d/rules I set
  SYSTEMIMAGE_DBUS_DAEMON_HUP_SLEEP_SECONDS to 2 so that on the buildd's
  there's a short blocking delay before continuing on to D-Bus activation.
  2 seconds is the Goldilocks value, 1 is definitely too short.  So far, with
  this change I've been able to very reliably avoid this particular problem on
  the buildds.

* Random D-Bus timeouts.

  This one is tougher, and more probably similar to what you're seeing.  Just
  every once in a while I get random D-Bus timeouts in response to some
  messages.  I've seen this on my desktop and laptop, and in PPAs and archive
  buildds, even when the systems do not seem to be overloaded.  There's no
  pattern that I can see related to *which* methods timeout - sometimes they
  break a test, other times they don't.  Sometimes they are messages between
  system-image-dbus and its client, and other times they are between
  ubuntu-download-manager and system-image-dbus.  I have no clue what's
  causing them, but fortunately(?), they're rare-ish for me.

  The only workaround I've come up with is to retry the test when developing
  locally, or to retry the build.  Usually the second or even more rarely
  third time, it will Just Work.

  I'd love to know more about what's going on here, but I suspect we're both
  seeing symptoms of the same problem.

Cheers,
-Barry


References