launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #09581
Branch scanner failures
Since the datacentre move, the codehosting branch scanner has been
intermittently failing. This manifests as an eternal "Updating
branch..." on the website, which is often not noticed till a diff
fails to appear in an associated merge proposal.
The failures in ackee/bzrsyncd/celeryd-job.log are along the lines of:
[2012-08-22 16:39:26,958: INFO/MainProcess] Got task from broker:
lp.services.job.celeryjob.CeleryRunJobIgnoreResult[BranchScanJob_14657367_c8b90ba9-db0a-4d2e-82b2-82413fd6b81e]
[2012-08-22 16:39:27,012: INFO/PoolWorker-2] Running <SCAN_BRANCH
branch job (4348709) for
~mandel/ubuntuone-client/use-new-fsevents-api> (ID 14657367) in status
Waiting
[2012-08-22 16:39:29,526: INFO/PoolWorker-2] Scanning branch:
~mandel/ubuntuone-client/use-new-fsevents-api
[2012-08-22 16:39:29,526: INFO/PoolWorker-2] from
lp-internal:///~mandel/ubuntuone-client/use-new-fsevents-api
[2012-08-22 16:39:29,526: INFO/PoolWorker-2] Retrieving history from bzrlib.
[2012-08-22 16:39:29,984: INFO/PoolWorker-2] Retrieving ancestry from database.
[2012-08-22 16:39:30,533: INFO/PoolWorker-2] Planning changes.
[2012-08-22 16:39:30,533: INFO/PoolWorker-2] Calculating history delta.
[2012-08-22 16:39:30,540: INFO/PoolWorker-2] Adding 1 new revisions.
[2012-08-22 16:39:31,699: INFO/PoolWorker-2] Job resulted in OOPS:
OOPS-030ff1ea23f05521d4fd9800a66a2a3a
[2012-08-22 16:39:31,700: INFO/MainProcess] Task
lp.services.job.celeryjob.CeleryRunJobIgnoreResult[BranchScanJob_14657367_c8b90ba9-db0a-4d2e-82b2-82413fd6b81e]
succeeded in 4.71384119987s: None
Unfortunately the traceback in the oops is not useful, as it's cleanup
fallout rather than the original error:
Traceback (most recent call last):
Module lazr.jobrunner.jobrunner, line 194, in runJobHandleError
self.runJob(job, fallback)
Module lp.services.job.runner, line 295, in runJob
super(BaseJobRunner, self).runJob(IRunnableJob(job), fallback)
Module lazr.jobrunner.jobrunner, line 162, in runJob
job.run()
Module lp.code.model.branchjob, line 331, in run
bzrsync.syncBranchAndClose()
Module contextlib, line 34, in __exit__
self.gen.throw(type, value, traceback)
Module lp.services.database.locking, line 50, in try_advisory_lock
store.execute(Select(AdvisoryUnlock(lock_type.value, lock_id)))
Module storm.store, line 108, in execute
return self._connection.execute(statement, params, noresult)
Module storm.databases.postgres, line 266, in execute
return Connection.execute(self, statement, params, noresult)
Module storm.database, line 238, in execute
raw_cursor = self.raw_execute(statement, params)
Module storm.databases.postgres, line 276, in raw_execute
return Connection.raw_execute(self, statement, params)
Module storm.database, line 322, in raw_execute
self._check_disconnect(raw_cursor.execute, *args)
Module storm.database, line 371, in _check_disconnect
return function(*args, **kwargs)
InternalError: current transaction is aborted, commands ignored until
end of transaction block
The normal workaround for branch scanner problems is to use a trick to
run it again, such as (thanks wgrant):
$ bzr push -r-2 --overwrite
$ bzr push
But at least for some branches, the failures seem to be consistent,
failing three times in a row.
Apart from fixing the job to not mask the error, deploying that code,
then seeing the actual problem, is there anything else we can try to
resolve this?
Martin
Follow ups