← Back to team overview

launchpad-dev team mailing list archive

Branch scanner failures

 

Since the datacentre move, the codehosting branch scanner has been
intermittently failing. This manifests as an eternal "Updating
branch..." on the website, which is often not noticed till a diff
fails to appear in an associated merge proposal.

The failures in ackee/bzrsyncd/celeryd-job.log are along the lines of:

[2012-08-22 16:39:26,958: INFO/MainProcess] Got task from broker:
lp.services.job.celeryjob.CeleryRunJobIgnoreResult[BranchScanJob_14657367_c8b90ba9-db0a-4d2e-82b2-82413fd6b81e]
[2012-08-22 16:39:27,012: INFO/PoolWorker-2] Running <SCAN_BRANCH
branch job (4348709) for
~mandel/ubuntuone-client/use-new-fsevents-api> (ID 14657367) in status
Waiting
[2012-08-22 16:39:29,526: INFO/PoolWorker-2] Scanning branch:
~mandel/ubuntuone-client/use-new-fsevents-api
[2012-08-22 16:39:29,526: INFO/PoolWorker-2]     from
lp-internal:///~mandel/ubuntuone-client/use-new-fsevents-api
[2012-08-22 16:39:29,526: INFO/PoolWorker-2] Retrieving history from bzrlib.
[2012-08-22 16:39:29,984: INFO/PoolWorker-2] Retrieving ancestry from database.
[2012-08-22 16:39:30,533: INFO/PoolWorker-2] Planning changes.
[2012-08-22 16:39:30,533: INFO/PoolWorker-2] Calculating history delta.
[2012-08-22 16:39:30,540: INFO/PoolWorker-2] Adding 1 new revisions.
[2012-08-22 16:39:31,699: INFO/PoolWorker-2] Job resulted in OOPS:
OOPS-030ff1ea23f05521d4fd9800a66a2a3a
[2012-08-22 16:39:31,700: INFO/MainProcess] Task
lp.services.job.celeryjob.CeleryRunJobIgnoreResult[BranchScanJob_14657367_c8b90ba9-db0a-4d2e-82b2-82413fd6b81e]
succeeded in 4.71384119987s: None

Unfortunately the traceback in the oops is not useful, as it's cleanup
fallout rather than the original error:

Traceback (most recent call last):
  Module lazr.jobrunner.jobrunner, line 194, in runJobHandleError
    self.runJob(job, fallback)
  Module lp.services.job.runner, line 295, in runJob
    super(BaseJobRunner, self).runJob(IRunnableJob(job), fallback)
  Module lazr.jobrunner.jobrunner, line 162, in runJob
    job.run()
  Module lp.code.model.branchjob, line 331, in run
    bzrsync.syncBranchAndClose()
  Module contextlib, line 34, in __exit__
    self.gen.throw(type, value, traceback)
  Module lp.services.database.locking, line 50, in try_advisory_lock
    store.execute(Select(AdvisoryUnlock(lock_type.value, lock_id)))
  Module storm.store, line 108, in execute
    return self._connection.execute(statement, params, noresult)
  Module storm.databases.postgres, line 266, in execute
    return Connection.execute(self, statement, params, noresult)
  Module storm.database, line 238, in execute
    raw_cursor = self.raw_execute(statement, params)
  Module storm.databases.postgres, line 276, in raw_execute
    return Connection.raw_execute(self, statement, params)
  Module storm.database, line 322, in raw_execute
    self._check_disconnect(raw_cursor.execute, *args)
  Module storm.database, line 371, in _check_disconnect
    return function(*args, **kwargs)
InternalError: current transaction is aborted, commands ignored until
end of transaction block

The normal workaround for branch scanner problems is to use a trick to
run it again, such as (thanks wgrant):

    $ bzr push -r-2 --overwrite
    $ bzr push

But at least for some branches, the failures seem to be consistent,
failing three times in a row.

Apart from fixing the job to not mask the error, deploying that code,
then seeing the actual problem, is there anything else we can try to
resolve this?

Martin


Follow ups