← Back to team overview

openerp-expert-framework team mailing list archive

Re: Exception TransactionRollbackError not correctly handled ?

 

On 05/01/2012 01:09 PM, Valentin LAB wrote:
> While using openerp, Psycopg2 raises TransactionRollbackError quite 
> often even on small database. This does not seem to be easily 
> reproduceable as it seems to be a conflict between two thread accessing 
> the same table.

Yes, it is a normal and safe behavior of Postgres when running serializable
transactions, whenever it detects that concurrent threads have performed
operations that may not give the same result if they had been executed
sequentially. For example 2 concurrent updates on the same row after reading
its value will likely suffer from a "nonrepeatable read" issue [1], but there
are much many more indirect cases.


> This occurs mostly at module installation. And can completely mess up
> the module installation by giving empty wizard windows of instance.
> 
> So it's not just annoying, this can break badly a database. I guess it 
> could also occurs in other situations (in multi-user context), where the 
> bug would be quite difficult to reproduce and with unforeseeable 
> consequences ;)

It can indeed happen in other circumstances, and the normal consequence is to
rollback completely one of the operations. For GUI-based operation it is quite
safe and the user can simply retry the operation again (provided they
understand the error - the error reporting in that case needs improvement).

More annoying cases would be the single-user issues during installation where
the concurrent requests performed by the 6.1 web client may inadvertently cause
this error: this is what you are seeing. Unattended requests driven by
automated RPC calls may also get this, and without proper error management on
the caller side this could be disastrous. This has been discussed in the past,
e.g. on bug 746620 [2].


> In my case (single user), it seem to hit more often on fast computers. 
> To make a probable better guess, it seems to hurt more often whenever 
> using a local connection between the browser and the server. It could be 
> about the web module trying to update the res_users session info and may 
> collide with normal operation.

Yup, this is most likely caused by the new web client executing most RPC calls
in parallel (quite necessary to speed up page rendering).


> To give more info on non-base_crypt errors, this is a SQL query that is 
> often triggering the TransactionRollbackError::
> 
>    UPDATE res_users
>      SET
>          date=now() AT TIME ZONE 'UTC'
>      WHERE
>          id=1
>        AND password=E'$1$ctHfTPNo$0BbeJr1qjUDCINSQh8zci.'
>        AND active
>      RETURNING id

This is the small transaction that updates the last login date whenever a user
logs into the system. It is likely that the web client is doing one extra
login() call at the same time as getting the result of the module install,
causing Postgres to detect a possible serialization problem.
Most transactions inserting or updating a row will indeed lock the res_user row
of the creator of this row (due to the create_uid foreign key), and thus any
transaction that updates a user row is much more likely to cause a
serialization problem.
Fortunately there are very few transactions that actually modify the user
record: login() is one of them and should only be performed when a user
connects to the system.



> And here is a bug report which I found while doing mine:
> https://bugs.launchpad.net/openobject-server/+bug/956715
> 
> 
> After looking quickly on the net, I found that it's perfectly "normal" 
> to receive this exception from postgres, and that the openerp engine 
> should simply retry a little bit later.
> 
> Is this info correct ?

Yup, it is a normal error, and the transaction should simply be retried later,
either manually (as it is now) or automatically.



> If yes, I provided a quick patch that works well 
> in my case. But it should be reviewed carefully as I'm not particularly 
> aware of database world.
> 
> merge proposal:
> https://code.launchpad.net/~vaab/openobject-server/fix-transactionrollbackerror-by-retrying/+merge/104233
> 
> bug report:
> https://bugs.launchpad.net/openobject-server/+bug/992525

The idea having OpenERP retry transactions that failed because of a
serialization problem has been discussed in the past, and it sounds fine to me.
However your implementation is too low-level: you are replaying only the last
cr.execute() call, while it is the whole transaction that was rolled back. It
will work for a few trivial cases (like the one-query transaction that updates
the last login date), but a real transaction could consist in hundreds of
queries, and discarding the previous queries will cause unpredictable madness.

If you could move the retry logic higher in the stack and make it retry the
whole RPC call, it should become a workable patch, but probably less simple.

Other remarks on the patch:
- Different reasons may cause TransactionRollbackErrors, and I'm not 100% sure
we want to retry all of them. The implementation discussion[3] indicates that
it corresponds to Class 40 Postgres errors[4] which are apparently intended to
be retried, so perhaps we don't need to be more restrictive.
Any opinions there? How is error 40002 different from the normal integrity
constraint violations of class 23, for example?
- The serialization problem will often come from a single competing
transaction, so retrying 1-2 times should usually be sufficient, shouldn't it?

Thanks for working on this, these issues certainly deserve some love!


[1] http://www.postgresql.org/docs/9.0/static/transaction-iso.html
[2] See the _long_ discussion on
https://bugs.launchpad.net/openobject-server/+bug/746620
[3] http://thread.gmane.org/gmane.comp.python.db.psycopg.devel/4384
[4] http://www.postgresql.org/docs/current/static/errcodes-appendix.html


Follow ups

References