← Back to team overview

openerp-india team mailing list archive

[Bug 992525] Re: TransactionRollbackError due to concurrent update could be better handled

 

We've just written a patch to implement this auto-retry logic in all affected branches: 6.0, 6.1, 7.0 (and trunk).
It should be fairly safe and without much side-effect: it will try to replay RPC calls that result in a transaction rollback, caused by one of these 3 PostgreSQL error codes[1]:

 - SERIALIZATION_FAILURE (40001 - "cannot serialize transactions due to concurrent update")
 - DEADLOCK_DETECTED (40P01)
 - LOCK_NOT_AVAILABLE (50P03 - "could not obtain lock on row in relation ...")

Each of these errors is transient and caused by the presence of another concurrent transaction working on the same database entries. The likelihood of seeing that other transaction committed increases with every passing millisecond, so in most cases it should be sufficient to retry once after a little while.
After testing this patch with several clients hammering the server at the same time, we noticed that having 3-4 retries with several hundred milliseconds randomized delay seems to be allow them all to pass, whereas if we retry only once we still get a few failures when there are more than 2 concurrent transactions doing the same thing.

Concerning the side-effects, the failed transactions have just been
rolled back, so replaying them is correct on a semantic level. In rare
cases the rolled back transaction might have had a side effect on the
rest of the world (e.g. sent an email or written a file), so replaying
it might cause the side-effect to occur a second time. However this
would be true even with manual replay instead of automatic replay - the
user could simply press the same button again to retry. Basically we're
just assuming the user did mean the transaction to happen so we're
pressing the button again for her.

We've though of making the retry delay and/or count configurable, but
the defaults should be fine for most cases. And if the default values
are not good enough a proper analysis of the concurrency issue would
probably be better than bumping up the settings without understanding
them. With the default settings the auto-retry could delay the
transaction for up to several dozen seconds, which already seems like a
very large limit. Most auto-retried transactions will not be delayed for
more than a few hundred milliseconds though.

Any feedback/tests for these sensitive patches would be appreciated.
We're planning to merge them soon unless a  problem is detected.

Thanks!

[1] see http://www.postgresql.org/docs/current/static/errcodes-
appendix.html#ERRCODES-TABLE and
http://initd.org/psycopg/docs/errorcodes.html

** Changed in: openobject-server
       Status: Confirmed => Fix Committed

-- 
You received this bug notification because you are a member of OpenERP
Indian Team, which is subscribed to OpenERP Server.
https://bugs.launchpad.net/bugs/992525

Title:
  TransactionRollbackError due to concurrent update could be better
  handled

Status in OpenERP Server:
  Fix Committed

Bug description:
  While using openerp, psycopg2 raises TransactionRollbackError quite
  often even on small database.

  This does not seem to be easily reproduceable as it seems to be a
  conflict between two thread accessing the same table. Nevertheless, I
  provided a quick video reproducing this while installing "base_crypt"
  on my computer.

  This occurs mostly at module installation. And can completely mess up
  the module installation by giving empty wizard windows of instance.

  I guess it could also occurs in other situations (in multi-user
  context), where the bug would be quite difficult to reproduce and with
  unforeseeable consequences ;)

  I've spotted an other bug that is due to this it seems:
  https://bugs.launchpad.net/bugs/956715

  In my case (single user), it seem to hit more often on fast computers.
  To make a probable better guess, it seems to hurt more often whenever
  using a local connection between the browser and the server. It could
  be about the web module trying to update the res_users session info
  and may collide with normal operation.

  On my computer, from a new database, installing the 'base_crypt' will trigger the exception.
  When using a distant connection, the bug won't show up.

  Please check the video I've posted with the bug report if you want to
  have more detail on the procedure I used. Sorry for the bad sound
  recording. Note that the video will show you the bug occuring on my
  computer and NOT occuring on a distant computer.

  I'm providing a merge proposal along with this patch which solves the
  issue for me, but need a patient review.

To manage notifications about this bug go to:
https://bugs.launchpad.net/openobject-server/+bug/992525/+subscriptions


References