← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Setback

 

Hello Elena and all,
I have pushed the fixed code. There are a lot of changes in it because I
went through all the code making sure that it made sense. The commit is here
<https://github.com/pabloem/Kokiri/commit/7c47afc45a7b1f390e8737df58205fa53334ba09>,
and although there are a lot of changes, the main line where failures are
caught or missed is this
<https://github.com/pabloem/Kokiri/blob/7c47afc45a7b1f390e8737df58205fa53334ba09/simulator.py#L496>
.

   1. The test result file edition information helps improve recall - if
   marginally
   2. The time since last run information does not improve recall much at
   all - See [Weaknesses - 2]

A couple of concepts that I want to define before going on:

   - *First failures*. These are failures that happen because of new bugs.
   They don't occur close in time as part of a chain of failures. The occur as
   a consequence of a transaction that introduces a bug, but they might occur
   soon or long after this transaction (usually soon, rather than long). They
   might be correlated with the frequency of failure of a test (core or basic
   tests that fail often might be specially good at exposing bugs); but many
   of them are not (tests of a feature, that don't fail often, but rather,
   when that feature is modified).
   - *Strict simulation mode.* This is the mode where, if a test is not
   part of the running set, its failure is not considered.

Weaknesses:

   - It's very difficult to predict 'first failures'. With the current
   strategy, if it's been long since a test failed (or if it has never failed
   before), the relevancy of the test just goes down, and it never runs.
   - Specially in database, and parallel software, there are bugs that hide
   in the code for a long time until one test discovers them. Unfortunately,
   the analysis that I'm doing requires that the test runs exactly when the
   data indicates it will fail. If a test that would fail doesn't run in test
   run Z, even though it might run in test run Z+1, the failure is just
   considered as missed, as if the bug was 'not encountered' ever.
      - This affects the *time since last run* factor. This factor helps
      encounter 'hidden' bugs that can be exposed by tests that have
not run, but
      the data available makes it difficult
      - This would also affect the *correlation* factor. If test A and B
      fail together often, and on test_run Z both of them would fail,
but only A
      runs, the heightened relevancy of B on the next test_run would
not make it
      catch anything (again, this is a limitation of the data, not of reality)
   - Humans are probably a lot better at predicting first failures than the
   current strategy.

Some ideas:

   - I need to be more strict with my testing, and reviewing my code : )
   - I need to improve prediction of 'first failures'. What would be a good
   way to improve this?
      - Correlation between files changed - Tests failed? Apparently Sergei
      tried this, but the results were not too good - But this is
before running
      in strict simulation mode. With strict simulation mode, anything
that could
      help spot first failures could be considered.

I am currently running tests to get the adjusted results. I will graph
them, and send them out in a couple hours.
Regards

Pablo


On Fri, Jun 13, 2014 at 12:40 AM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
wrote:

> Hi Pablo,
>
> Thanks for the update.
>
>
> On 12.06.2014 19:13, Pablo Estrada wrote:
>
>> Hello Sergei, Elena and all,
>> Today while working on the script, I found and fixed an issue:
>>
>> There is some faulty code code in my script that is in charge of
>> collecting
>> the statistics about whether a test failure was caught or not (here
>> <https://github.com/pabloem/Kokiri/blob/master/basic_simulator.py#L393>).
>> I
>> looked into fixing it, and then I could see another *problem*: The *recall
>> numbers* that I had collected previously were *too high*.
>>
>> The actual recall numbers, once we consider the test failures that are
>> *not
>> caught*, are disappointingly lower. I won't show you results yet, since I
>>
>> want to make sure that the code has been fixed, and I have accurate tests
>> first.
>>
>> This is all for now. The strategy that I was using is a lot less effective
>> than it seemed initially. I will send out a more detailed report with
>> results, my opinion on the weak points of the strategy, and ideas,
>> including a roadmap to try to improve results.
>>
>> Regards. All feedback is welcome.
>>
>
> Please push your fixed code that triggered the new results, even if you
> are not ready to share the results themselves yet. It will be easier to
> discuss then.
>
> Regards,
> Elena
>
>
>  Pablo
>>
>>
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~maria-developers
>> Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~maria-developers
>> More help   : https://help.launchpad.net/ListHelp
>>
>>

Follow ups

References