← Back to team overview

yellow team mailing list archive

Re: One more thought...

 

On Thu, Jul 5, 2012 at 9:35 AM, Gary Poster <gary.poster@xxxxxxxxxxxxx> wrote:
> Hey Stuart.  Benji and I did a bit more hunting for the bug I talked
> about last time.  If you recall, rarely/intermittently we see errors in
> lp.services.webapp.tests.test_error.TestDatabaseErrorViews.
>
> It turns out that when the test fails, the log output does look very
> different than when it succeeds.  Specifically, the log shows a lot of
> repetition of essentially the bits in the chunk I show below.
>
> The part I want to call attention to you is the first part: the mention
> of bug 504291.  The store has been left in a disconnected state, it seems.
>
> The code that Robert shows in comment 11 of that bug still exists in the
> tree, but the repair code (store.rollback()) doesn't seem to work. We
> keep getting the problem over and over again, which as I understand it
> is a characteristic of that broken state not being repaired.
>
> I tried putting in some more transaction.abort() calls in seemingly
> strategic places, per Jeroen's comment #5 that some aborts might help
> (though see my comment #6 indicating that we are already aborting when
> he suggests, via transaction.begin()), but I still see the problem.
>
> At this point, I'm particularly interested in two things from you.
>
> (1) Is there a more reliable way to unbreak the seemingly persistent
> disconnected state, or another way that you'd suggest I try?

If transaction.abort() doesn't unbreak things, then that suggests the
store isn't registered with the transaction?

>From what I recall about that assert, that code should only be
triggered if we failed to call transaction.abort(), or if
transaction.abort() failed to reset the Store from its bogus state.
Either of these cases is a bug.

Did you add an abort() just before the assert, and did it make the
assert go away? If it doesn't go away, then perhaps the connection is
broken before store is being registered with the transaction
machinery. But that shouldn't happen, as I seem to recall a Storm fix
ensuring that it was added to the transaction machinery before any
actual queries where run. But perhaps if the initial connection setup
fails and we start in a broken state? Or perhaps we are relying on
txn.abort() to reset the store to a sane state at the start of a
request, and it does nothing because the store isn't registered with
the transaction manager at this point? Or perhaps the store reset is
failing and the exception swallowed by the transaction machinery?


> (2) Do you see anything in the log blather after the first thing I quote
> for you below that suggests to you that I (and the traceback) are
> misidentifying the problem as associated with 504291?

No, that OOPS is pretty clear. The assert is there to catch a state
that should never happen.  (It does, however, do repairs and lets
things continue with an explicit store.rollback, no relying on the
transaction machinery). There was some sort of failure before the
assert was reached.

  - failed on initial setup, eg. the database isn't available. Nothing
is going to work. Things should fail earlier.
  - failed because left in a bogus state from a previous test run.
  - failed unexpectedly during test run, but exceptions swallowed and
store not reset.

There is some interesting code just after the assert too:

            # Reset all Storm stores when not running the test suite.
            # We could reset them when running the test suite but
            # that'd make writing tests a much more painful task. We
            # still reset the slave stores though to minimize stale
            # cache issues.
            if thread_name != 'MainThread' or name.endswith('-slave'):
                store.reset()

I can't recall what sort of tests that require this, but I think it
dates from the migration to storm. Maybe our tests are nicer now and
we can always do the store.reset()?



-- 
Stuart Bishop <stuart.bishop@xxxxxxxxxxxxx>


References