maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #07386
Re: [GSoC] Optimize mysql-test-runs - Setback
Hi Pablo,
On 13.06.2014 14:12, Pablo Estrada wrote:
Hello Elena and all,
I have pushed the fixed code. There are a lot of changes in it because I
I went through your code (the latest revision). I think I'll postpone
detailed in-code comments till the next iteration, and instead will just
list my major concerns and questions here.
1. Structure
The structure of the code seems to be not quite what we need at the end.
As you know, the goal is to return a set (list) of tests that MTR would
run. I understand that you are currently experimenting and hence don't
have the final algorithm to produce a single list. But what I would
expect to see is
- a core module which takes various parameters -- type of metric,
running set, calculation mode, etc. -- does all the work and produces
such a list (possibly with some statistical data which you need for the
research, and which might be useful later anyway);
- a wrapper which would feed the core module with different sets of
parameters, get the results and compare them.
After the research is finished, the best parameters would become
default, the wrapper would be abandoned or re-written to pass the
resulting test list to MTR, while the core module would stay pretty much
intact.
At the first glance it seemed to be the case in your code, but it turned
out it was not.
run_basic_simulations.py looks like a wrapper described above, only it
does the extra work initializing the simulator, which it should not.
On the other hand, simulator.py does not look like the described core
module at all. It executes logic for all modes regardless the start up
parameters, and this logic is very interleaved. After you choose the
best approach, you will have to re-write it majorly, which is not only
waste of time but is also error-prone.
2. Cleanup
To prove the previous point, currently experiments that you run are not
independent. That is, if you call several simulations from
run_basic_simulations.py, only the very first one will use the correct
data and get real results. All consequent ones will use the data
modified by the previous ones, and the results will be totally irrelevant.
It happens because there is initial prepare in run_basic_simulations.py,
but there is no cleanup between simulations. The whole test_info
structure remains what it was by the end of the previous simulation,
importantly metrics. Also, test_edit_factor cannot work for any
simulations except for the first one at all, because during the
simulation you truncate the editions list, but never restore it.
3. Modes
These flaws should be easy to fix by doing proper cleanup before each
simulation. But there are also other fragments of code where, for
example, logic for 'standard' mode is supposed to be always run and is
relied upon, even if the desired mode is different.
In fact, you build all queues every time. It would be an understandable
trade-off to save the time on simulations, but you re-run them
separately anyway, and only return the requested queue.
4. Failed tests vs executed tests
Further, as I understand you only calculate the metrics for tests which
were either edited, or failed at least once; and thus, only such tests
can ever make to a corresponding queue. Not only does it create a
bubble, but it also makes the comparison of modes faulty, and the whole
simulation less efficient.
Lets suppose for simplicity that we do not use the editing factor.
In standard mode, the number of relevant failed tests for a single test
run is obviously greater than lets say in mixed mode (because in
standard mode all failures count, while in mixed mode -- only those that
happened on platform+branch). So, when in the standard mode you'll
calculate metrics for lets say 1K tests, in the mixed mode for a
particular combination of platform+branch you'll do so only for 20
tests. It means that even though you set the running set to 500, in fact
you'll only run 20 tests at most. It's not desirable -- if we say we can
afford running 500 tests, we'd rather run 500 than 20, even if some of
them never failed before. This will also help us break the bubble,
especially if we randomize the "tail" (tests with the minimal priority
that we add to fill the queue). If some of them fail, they'll get a
proper metric and will migrate to the meaningful part of the queue.
I know you don't have all the data about which tests were run or can be
run in a certain test run; but for initial simulation the information is
fairly easy to obtain -- just use the corresponding stdio files which
you can obtain via the web interface, or run MTR to produce the lists;
and in real life it should be possible to make MTR pass it over to your
tool.
To populate the queue, You don't really need the information which tests
had ever been run; you only need to know which ones MTR *wants* to run,
if the running set is unlimited. If we assume that it passes the list to
you, and you iterate through it, you can use your metrics for tests that
failed or were edited before, and a default minimal metric for other
tests. Then, if the calculated tests are not enough to fill the queue,
you'll randomly choose from the rest. It won't completely solve the
problem of tests that never failed and were never edited, but at least
it will make it less critical.
5. Running set
It's a smaller issue, but back to the real usage of the algorithm, we
cannot really set an absolute value of the running set. MTR options can
be very different, in one builder it can run hundreds tests the most, in
another thousands. We should use a percentage instead.
6. Full / non-full simulation mode
I couldn't understand what the *non*-full simulation mode is for, can
you explain this?
7. Matching logic (get_test_file_change_history)
The logic where you are trying to match result file names to test names
is not quite correct. There are some highlights:
There can also be subsuites. Consider the example:
./mysql-test/suite/engines/iuds/r/delete_decimal.result
The result file can live not only in /r dir, but also in /t dir,
together with the test file. It's not cool, but it happens, see for
example mysql-test/suite/mtr/t/
Here are some other possible patterns for engine/plugin suites:
./storage/tokudb/mysql-test/suite/tokudb/r/rows-32m-1.result
./storage/innodb_plugin/mysql-test/innodb.result
Also, in release builds they can be in mysql-test/plugin folder:
mysql-test/plugin/example/mtr/t
Be aware that the logic where you compare branch names doesn't currently
work as expected. Your list of "fail branches" consists of clean names
only, e.g. "10.0", while row[BRANCH] can be like
"lp:~maria-captains/maria/10.0". I'm not sure yet why it is sometimes
stored this way, but it is.
I had more comments/questions, but lets address these ones first, and
then we'll see what of the rest remains relevant.
Comments on your notes from the email are below inline.
went through all the code making sure that it made sense. The commit is here
<https://github.com/pabloem/Kokiri/commit/7c47afc45a7b1f390e8737df58205fa53334ba09>,
and although there are a lot of changes, the main line where failures are
caught or missed is this
<https://github.com/pabloem/Kokiri/blob/7c47afc45a7b1f390e8737df58205fa53334ba09/simulator.py#L496>
.
1. The test result file edition information helps improve recall - if
marginally
2. The time since last run information does not improve recall much at
all - See [Weaknesses - 2]
Lets get back to it (both of them) after the logic with dependent
simulations is fixed, after that we'll review it and see why it doesn't
work if it still doesn't. Right now any effect that file edition might
have is rather coincidental, possibly the other one is also broken.
A couple of concepts that I want to define before going on:
- *First failures*. These are failures that happen because of new bugs.
They don't occur close in time as part of a chain of failures. The occur as
a consequence of a transaction that introduces a bug, but they might occur
soon or long after this transaction (usually soon, rather than long). They
might be correlated with the frequency of failure of a test (core or basic
tests that fail often might be specially good at exposing bugs); but many
of them are not (tests of a feature, that don't fail often, but rather,
when that feature is modified).
- *Strict simulation mode.* This is the mode where, if a test is not
part of the running set, its failure is not considered.
Weaknesses:
- It's very difficult to predict 'first failures'. With the current
strategy, if it's been long since a test failed (or if it has never failed
before), the relevancy of the test just goes down, and it never runs.
- Specially in database, and parallel software, there are bugs that hide
in the code for a long time until one test discovers them. Unfortunately,
the analysis that I'm doing requires that the test runs exactly when the
data indicates it will fail. If a test that would fail doesn't run in test
run Z, even though it might run in test run Z+1, the failure is just
considered as missed, as if the bug was 'not encountered' ever.
What you call "First failures" is the main target of the regression test
suite. So, however difficult they are to predict, we should attempt to
do so. On the bright side, we don't need to care that much about the
other type, those that "hide in the code for a long time". There are
indeed sporadic failures of either code or a test, which happen every
now and then, some often, some rarely; but they are not what the test
suite is after. Ideally, they should not exist at all, the regression
test suite is supposed to be totally deterministic, which means that a
test that passed before may only fail if the related code or the test
itself changed.
So, both "test edit" factor and "time" factor are not really expected to
improve recall a lot, their purpose is to help to break the bubble. New
and edited tests must run, it seems obvious. The time factor is less
obvious but it's our only realistic way to make sure that we don't
forget some tests forever.
- This affects the *time since last run* factor. This factor helps
encounter 'hidden' bugs that can be exposed by tests that have
not run, but
the data available makes it difficult
- This would also affect the *correlation* factor. If test A and B
fail together often, and on test_run Z both of them would fail,
but only A
runs, the heightened relevancy of B on the next test_run would
not make it
catch anything (again, this is a limitation of the data, not of reality)
- Humans are probably a lot better at predicting first failures than the
current strategy.
This is true, unfortunately it's a full time job which we can't afford
to waste a human resource on.
Some ideas:
- I need to be more strict with my testing, and reviewing my code : )
- I need to improve prediction of 'first failures'. What would be a good
way to improve this?
Putting aside code changes which are too difficult to analyze, the only
obvious realistic way is to combine test editing with time factor, tune
the time factor better, and also add randomization of the tests with
equal priority that you put at the end of the queue.
- Correlation between files changed - Tests failed? Apparently Sergei
tried this, but the results were not too good - But this is
before running
in strict simulation mode. With strict simulation mode, anything
that could
help spot first failures could be considered.
As discussed before, it seems difficult to implement. Lets fix what we
have now, and if the results are still not satisfactory, re-consider it
later.
I am currently running tests to get the adjusted results. I will graph
them, and send them out in a couple hours.
Please at least fix the dependent logic first.
You should be able to see it easily by changing the order of modes in
run_basic_simulations -- e.g. try to run standard / platform / branch /
mixed for one running set, and then run again, with mixed / branch /
platform / standard.
Regards,
Elena
Regards
Pablo
On Fri, Jun 13, 2014 at 12:40 AM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
wrote:
Hi Pablo,
Thanks for the update.
On 12.06.2014 19:13, Pablo Estrada wrote:
Hello Sergei, Elena and all,
Today while working on the script, I found and fixed an issue:
There is some faulty code code in my script that is in charge of
collecting
the statistics about whether a test failure was caught or not (here
<https://github.com/pabloem/Kokiri/blob/master/basic_simulator.py#L393>).
I
looked into fixing it, and then I could see another *problem*: The *recall
numbers* that I had collected previously were *too high*.
The actual recall numbers, once we consider the test failures that are
*not
caught*, are disappointingly lower. I won't show you results yet, since I
want to make sure that the code has been fixed, and I have accurate tests
first.
This is all for now. The strategy that I was using is a lot less effective
than it seemed initially. I will send out a more detailed report with
results, my opinion on the weak points of the strategy, and ideas,
including a roadmap to try to improve results.
Regards. All feedback is welcome.
Please push your fixed code that triggered the new results, even if you
are not ready to share the results themselves yet. It will be easier to
discuss then.
Regards,
Elena
Pablo
_______________________________________________
Mailing list: https://launchpad.net/~maria-developers
Post to : maria-developers@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~maria-developers
More help : https://help.launchpad.net/ListHelp
Follow ups
References