maria-developers team mailing list archive

Thread
Date

Re: [GSoC] Optimize mysql-test-runs - Setback

To: Pablo Estrada <polecito.em@xxxxxxxxx>
From: Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
Date: Mon, 16 Jun 2014 04:09:32 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CABDWuanQvsDPNSFAvV_WnLFsmZ+c9JzzFB+VtUKGjH0BvCNb_g@mail.gmail.com>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

Hi Pablo,

On 13.06.2014 14:12, Pablo Estrada wrote:

Hello Elena and all,
I have pushed the fixed code. There are a lot of changes in it because I

I went through your code (the latest revision). I think I'll postponedetailed in-code comments till the next iteration, and instead will justlist my major concerns and questions here.


1. Structure

The structure of the code seems to be not quite what we need at the end.As you know, the goal is to return a set (list) of tests that MTR wouldrun. I understand that you are currently experimenting and hence don'thave the final algorithm to produce a single list. But what I wouldexpect to see is

- a core module which takes various parameters -- type of metric,running set, calculation mode, etc. -- does all the work and producessuch a list (possibly with some statistical data which you need for theresearch, and which might be useful later anyway);

- a wrapper which would feed the core module with different sets ofparameters, get the results and compare them.

After the research is finished, the best parameters would becomedefault, the wrapper would be abandoned or re-written to pass theresulting test list to MTR, while the core module would stay pretty muchintact.

At the first glance it seemed to be the case in your code, but it turnedout it was not.

run_basic_simulations.py looks like a wrapper described above, only itdoes the extra work initializing the simulator, which it should not.

On the other hand, simulator.py does not look like the described coremodule at all. It executes logic for all modes regardless the start upparameters, and this logic is very interleaved. After you choose thebest approach, you will have to re-write it majorly, which is not onlywaste of time but is also error-prone.


2. Cleanup

To prove the previous point, currently experiments that you run are notindependent. That is, if you call several simulations fromrun_basic_simulations.py, only the very first one will use the correctdata and get real results. All consequent ones will use the datamodified by the previous ones, and the results will be totally irrelevant.

It happens because there is initial prepare in run_basic_simulations.py,but there is no cleanup between simulations. The whole test_infostructure remains what it was by the end of the previous simulation,importantly metrics. Also, test_edit_factor cannot work for anysimulations except for the first one at all, because during thesimulation you truncate the editions list, but never restore it.


3. Modes

These flaws should be easy to fix by doing proper cleanup before eachsimulation. But there are also other fragments of code where, forexample, logic for 'standard' mode is supposed to be always run and isrelied upon, even if the desired mode is different.

In fact, you build all queues every time. It would be an understandabletrade-off to save the time on simulations, but you re-run themseparately anyway, and only return the requested queue.


4. Failed tests vs executed tests

Further, as I understand you only calculate the metrics for tests whichwere either edited, or failed at least once; and thus, only such testscan ever make to a corresponding queue. Not only does it create abubble, but it also makes the comparison of modes faulty, and the wholesimulation less efficient.


Lets suppose for simplicity that we do not use the editing factor.

In standard mode, the number of relevant failed tests for a single testrun is obviously greater than lets say in mixed mode (because instandard mode all failures count, while in mixed mode -- only those thathappened on platform+branch). So, when in the standard mode you'llcalculate metrics for lets say 1K tests, in the mixed mode for aparticular combination of platform+branch you'll do so only for 20tests. It means that even though you set the running set to 500, in factyou'll only run 20 tests at most. It's not desirable -- if we say we canafford running 500 tests, we'd rather run 500 than 20, even if some ofthem never failed before. This will also help us break the bubble,especially if we randomize the "tail" (tests with the minimal prioritythat we add to fill the queue). If some of them fail, they'll get aproper metric and will migrate to the meaningful part of the queue.

I know you don't have all the data about which tests were run or can berun in a certain test run; but for initial simulation the information isfairly easy to obtain -- just use the corresponding stdio files whichyou can obtain via the web interface, or run MTR to produce the lists;and in real life it should be possible to make MTR pass it over to yourtool.

To populate the queue, You don't really need the information which testshad ever been run; you only need to know which ones MTR *wants* to run,if the running set is unlimited. If we assume that it passes the list toyou, and you iterate through it, you can use your metrics for tests thatfailed or were edited before, and a default minimal metric for othertests. Then, if the calculated tests are not enough to fill the queue,you'll randomly choose from the rest. It won't completely solve theproblem of tests that never failed and were never edited, but at leastit will make it less critical.


5. Running set

It's a smaller issue, but back to the real usage of the algorithm, wecannot really set an absolute value of the running set. MTR options canbe very different, in one builder it can run hundreds tests the most, inanother thousands. We should use a percentage instead.


6. Full / non-full simulation mode

I couldn't understand what the *non*-full simulation mode is for, canyou explain this?


7. Matching logic (get_test_file_change_history)

The logic where you are trying to match result file names to test namesis not quite correct. There are some highlights:


There can also be subsuites. Consider the example:
./mysql-test/suite/engines/iuds/r/delete_decimal.result

The result file can live not only in /r dir, but also in /t dir,together with the test file. It's not cool, but it happens, see forexample mysql-test/suite/mtr/t/


Here are some other possible patterns for engine/plugin suites:
./storage/tokudb/mysql-test/suite/tokudb/r/rows-32m-1.result
./storage/innodb_plugin/mysql-test/innodb.result
Also, in release builds they can be in mysql-test/plugin folder:
mysql-test/plugin/example/mtr/t

Be aware that the logic where you compare branch names doesn't currentlywork as expected. Your list of "fail branches" consists of clean namesonly, e.g. "10.0", while row[BRANCH] can be like"lp:~maria-captains/maria/10.0". I'm not sure yet why it is sometimesstored this way, but it is.

I had more comments/questions, but lets address these ones first, andthen we'll see what of the rest remains relevant.


Comments on your notes from the email are below inline.

went through all the code making sure that it made sense. The commit is here
<https://github.com/pabloem/Kokiri/commit/7c47afc45a7b1f390e8737df58205fa53334ba09>,
and although there are a lot of changes, the main line where failures are
caught or missed is this
<https://github.com/pabloem/Kokiri/blob/7c47afc45a7b1f390e8737df58205fa53334ba09/simulator.py#L496>
.

    1. The test result file edition information helps improve recall - if
    marginally
    2. The time since last run information does not improve recall much at
    all - See [Weaknesses - 2]

Lets get back to it (both of them) after the logic with dependentsimulations is fixed, after that we'll review it and see why it doesn'twork if it still doesn't. Right now any effect that file edition mighthave is rather coincidental, possibly the other one is also broken.


A couple of concepts that I want to define before going on:

    - *First failures*. These are failures that happen because of new bugs.
    They don't occur close in time as part of a chain of failures. The occur as
    a consequence of a transaction that introduces a bug, but they might occur
    soon or long after this transaction (usually soon, rather than long). They
    might be correlated with the frequency of failure of a test (core or basic
    tests that fail often might be specially good at exposing bugs); but many
    of them are not (tests of a feature, that don't fail often, but rather,
    when that feature is modified).
    - *Strict simulation mode.* This is the mode where, if a test is not
    part of the running set, its failure is not considered.

Weaknesses:

    - It's very difficult to predict 'first failures'. With the current
    strategy, if it's been long since a test failed (or if it has never failed
    before), the relevancy of the test just goes down, and it never runs.
    - Specially in database, and parallel software, there are bugs that hide
    in the code for a long time until one test discovers them. Unfortunately,
    the analysis that I'm doing requires that the test runs exactly when the
    data indicates it will fail. If a test that would fail doesn't run in test
    run Z, even though it might run in test run Z+1, the failure is just
    considered as missed, as if the bug was 'not encountered' ever.

What you call "First failures" is the main target of the regression testsuite. So, however difficult they are to predict, we should attempt todo so. On the bright side, we don't need to care that much about theother type, those that "hide in the code for a long time". There areindeed sporadic failures of either code or a test, which happen everynow and then, some often, some rarely; but they are not what the testsuite is after. Ideally, they should not exist at all, the regressiontest suite is supposed to be totally deterministic, which means that atest that passed before may only fail if the related code or the testitself changed.

So, both "test edit" factor and "time" factor are not really expected toimprove recall a lot, their purpose is to help to break the bubble. Newand edited tests must run, it seems obvious. The time factor is lessobvious but it's our only realistic way to make sure that we don'tforget some tests forever.

       - This affects the *time since last run* factor. This factor helps
       encounter 'hidden' bugs that can be exposed by tests that have
not run, but
       the data available makes it difficult
       - This would also affect the *correlation* factor. If test A and B
       fail together often, and on test_run Z both of them would fail,
but only A
       runs, the heightened relevancy of B on the next test_run would
not make it
       catch anything (again, this is a limitation of the data, not of reality)
    - Humans are probably a lot better at predicting first failures than the
    current strategy.

This is true, unfortunately it's a full time job which we can't affordto waste a human resource on.


Some ideas:

    - I need to be more strict with my testing, and reviewing my code : )
    - I need to improve prediction of 'first failures'. What would be a good
    way to improve this?

Putting aside code changes which are too difficult to analyze, the onlyobvious realistic way is to combine test editing with time factor, tunethe time factor better, and also add randomization of the tests withequal priority that you put at the end of the queue.

       - Correlation between files changed - Tests failed? Apparently Sergei
       tried this, but the results were not too good - But this is
before running
       in strict simulation mode. With strict simulation mode, anything
that could
       help spot first failures could be considered.

As discussed before, it seems difficult to implement. Lets fix what wehave now, and if the results are still not satisfactory, re-consider itlater.


I am currently running tests to get the adjusted results. I will graph
them, and send them out in a couple hours.


Please at least fix the dependent logic first.

You should be able to see it easily by changing the order of modes inrun_basic_simulations -- e.g. try to run standard / platform / branch /mixed for one running set, and then run again, with mixed / branch /platform / standard.



Regards,
Elena

Regards

Pablo


On Fri, Jun 13, 2014 at 12:40 AM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
wrote:

Hi Pablo,

Thanks for the update.


On 12.06.2014 19:13, Pablo Estrada wrote:

Hello Sergei, Elena and all,
Today while working on the script, I found and fixed an issue:

There is some faulty code code in my script that is in charge of
collecting
the statistics about whether a test failure was caught or not (here
<https://github.com/pabloem/Kokiri/blob/master/basic_simulator.py#L393>).
I
looked into fixing it, and then I could see another *problem*: The *recall
numbers* that I had collected previously were *too high*.

The actual recall numbers, once we consider the test failures that are
*not
caught*, are disappointingly lower. I won't show you results yet, since I

want to make sure that the code has been fixed, and I have accurate tests
first.

This is all for now. The strategy that I was using is a lot less effective
than it seemed initially. I will send out a more detailed report with
results, my opinion on the weak points of the strategy, and ideas,
including a roadmap to try to improve results.

Regards. All feedback is welcome.


Please push your fixed code that triggered the new results, even if you
are not ready to share the results themselves yet. It will be easier to
discuss then.

Regards,
Elena


  Pablo




_______________________________________________
Mailing list: https://launchpad.net/~maria-developers
Post to     : maria-developers@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~maria-developers
More help   : https://help.launchpad.net/ListHelp

Follow ups

Re: [GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-17
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Sergei Golubchik, 2014-06-16

References

[GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-12
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Elena Stepanova, 2014-06-12
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-13