← Back to team overview

launchpad-dev team mailing list archive

Re: Does our DB retry code need tweaks for the PG 9.1 upgrade?

 

On Thu, May 31, 2012 at 10:04 PM, Francis J. Lacoste
<francis.lacoste@xxxxxxxxxxxxx> wrote:
> Hi Stuart,
>
> We have a cluster of recent bugs that seems to hint that the retry
> transaction code might need some tweaking since our upgrade to PG 9.1.
>
> https://bugs.launchpad.net/launchpad/+bug/1000805
>
> That first one is a
>
> psycopg2.OperationalError: could not send data to server: Connection
> timed out
>
> when serving private attachments from the librarian. Usually, attempting
> again will work. Is that a new error in PG 9.1 that we should add to the
> retry list? It only re-attempts DisconnectionError, IntegrityError and
> TransactionalRollbackError.

Its not PG 9.1 - this is entirely client side. The trigger was likely
psycopg2 2.4 or libpq5, both of which needed to be upgraded before the
PG 9.1 upgrade. I've updated the bug report - Storm needs to catch
these exceptions so connections get reopened, and it will reraise them
as a DisconnectionError IIRC.

It might also be new because our sockets were not failing like this
before. We really shouldn't be losing sockets like this - perhaps a
pg_bouncer upgrade is in order? I think the relevant connection limit
in pg_bouncer was set to 20 connections and was recently bumped to 40.


> https://bugs.launchpad.net/launchpad/+bug/1006530
> https://bugs.launchpad.net/launchpad/+bug/1006531
>
> These two are OOPSes triggered during fastdowntime. I was under the
> impression that we weren't logging those during fastdowntime and thus
> our filters might need updating. Or maybe, I'm mistaken and it's just
> that Diogo is our normal filter here, and since he's on leave this it
> explains why Laura reported bugs about those.
>
> Thanks for your insights.

We log OOPSes during fastdowntime, because fastdowntime looks exactly
like a database outage from the client side and we want to know about
database outages. I'm not sure what filtering was being done to hide
them from the reports. We should report these failures if they happen
outside of the scheduled fastdowntime window.



-- 
Stuart Bishop <stuart.bishop@xxxxxxxxxxxxx>


Follow ups

References