← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy


Hi Pablo,

On 17.07.2014 16:17, Pablo Estrada wrote:
Hello Elena,
It took me a while to figure out how the files and the test_run s
correspond to each other, and there might still be some hard-to-solve
inconsistencies with them: There were a few cases where it is not easy to
determine -automatically- which file corresponds to which test_run (some
cases where there are more platform+build test_runs than files)... but
excluding those cases, yes, there are about 28k files that can be matched
to test_runs appropriately.

As I said in the private email, the files are determined by a pair of platform / bbnum. It will definitely happen that there are test runs in the dump which don't have corresponding output files, and that there are files which don't have corresponding records in the data dump (because the files are fresher than the dump). Both of that are expected.

Looking at the dump, I see it can also happen that the dump contains several records for a pair platform/bbnum. I am not sure why it happens, I think it shouldn't, might be a bug in buildbot and/or configuration, or environmental problems. Anyway, due to the way we store output files, they can well override each other in this case, thus for several platform/bbnum record you will have only one file. I suppose that's what was hard to resolve, sorry about that.

Anyway, if you got 28K files, it should be more than enough for experiments, since you are normally running them on ~5K test runs.

The distribution of these is quite random. They start matching around
test_run #10,000 and then, they go on matching sometimes and sometimes not.

What I'm doing, is the following:

    1. If there is a file that matches this test_run: Parse the file, and
    return the tests in the file as the input list. I am not considering
    'skipped' tests, because it seems that they are skipped because they can't
    be run.

You should consider skipped tests, at least for now. Your logic that they are skipped because they can't be run is generally correct; unfortunately, MTR first produces the *full* list of tests to run, and determines whether a test can be run or not on a later stage, when it starts running the tests. Your tool will receive the initial test list, and I'm not sure it's realistic to re-write MTR so that it takes into account limitations that cause skipping tests before creating the list.

    2. If there is no file matching test_run: Consider ALL known tests as
    being in the input list.

I need to think about it.
Possibly it's better to skip a test run altogether if there is no input list for it; it would be definitely the best if there were 5K (or whatever slice you are currently using) of continuous test runs with input lists; if it so happens that there are lists for some branches but not others, you can skip the branch entirely.

I would like to get some of your feedback on a couple of things:

    - I would still like to define some structure for the interfaces -even
    if a bit loose.

If you mean separation the core module / wrapper, it should go like that.

The core module should take as parameters
- list of tests to choose from,
- size of the running set (%),
- branch/platform (if we use them in the end),
and produce a new list of tests of the size of the running set.

The wrapper module should
- read the list of tests from the outside world (for now, from a file),
- receive branch/platform as command-line options,
- have the running set size set as an easily changeable constant or as a configuration parameter,

and return the list of tests -- lets say for now, in the form of <test suite>.<test name>, blank-separated, e.g.
main.select innondb.create-index ...

    - You mentioned earlier that rather than a specific running_set, you
    wanted to use a percentage. We can work like this.

Yes, we should. Now if you look at those input files, you can see that the number of running tests is considerably different. Grep the files for 'Completed: All ' (this will exclude unsuccessful runs where test execution just stopped due to whatever reason), and you'll see that there are runs with 3.5K tests, and with 1.5K tests, and with 150 tests... So any constant running_set size you choose will be meaningless for one run or another.

Of course, we can go smart with the percentage and lets say not apply it to the smallest test runs (or make it flexible), but for now just a single percentage will do.

    - Do you have any feedback on points 1 and 2 regarding the handling of
    the input test lists?

And one more thing:

    - I have not incorporated test variant into the data, but I'll spend
    some time thinking about how to do this.

It can be difficult, so it would be better to analyze the data first and see if it makes any (useful) difference.


That's it for now.


On Wed, Jul 16, 2014 at 1:10 AM, Pablo Estrada <polecito.em@xxxxxxxxx>

Hi Elena,
A small progress report: I was able to quickly make the changes related to
selecting code changes to measure correlations with test failures. Recall
is still around 80% with running set of 300 and short prediction stages. I
can focus now on the input file list, since I believe this will make
results more realistic, and (I expect)  help push recall a further up.

Can you please upload the report files from MTR, so that I can include the
logic of an input test list?

Also, since I am going to incorporate this logic, it might be good to
define (even if just roughly) the "core module" and the "wrapper module"
that you had mentioned earlier, rather than just incorporating the list,
and making the code that I have now even more bloated with mixed up
functionalities. What do you think?


On Tue, Jul 15, 2014 at 2:18 PM, Pablo Estrada <polecito.em@xxxxxxxxx>

Hello Elena,

Can you give a raw estimation of a ratio of failures missed due to being
low in the priority queue vs those that were not in the queue at all?

I sent this information in a previous email, here:

Also, once again, I would like you to start using an incoming test list
as an initial point of your test set generation. It must be done sooner or
later, I already explained earlier why; and while it's not difficult to
implement even after the end of your project, it might affect the result
considerably, so we need to know if it makes it better or worse, and adjust
the algorithm accordingly.

You are right. I understand that this information is not fully available
for all the test_runs, so can you upload the information going back as much
as possible? I can parse these files and adjust the program to work with
this. I will get on to work with this, I think this should significantly
improve results. I think, it might even push my current strategy from
promising results into attractive ones.

There are several options which change the way the tests are executed;
e.g. tests can be run in a "normal" mode, or in PS protocol mode, or with
valgrind, or with embedded server. And it might well be that some tests
always fail e.g. with valgrind, but almost never fail otherwise.
Information about these options is partially available in test_run.info,
but it would require some parsing. It would be perfect if you could analyze
the existing data to understand whether using it can affect your results
before spending time on actual code changes.

I will keep this in consideration, but for now I will focus on these two
main things:

    - Improving precision of selecting code changes to estimate
    correlation with test failures
    - Adding the use of an incoming test list

When we are trying to watch all code changes and find correlation with
test failures, if it's done well, it should actually provide immediate
gain; however, it's very difficult to do it right, there is way too much
noise in the statistical data to get a reliable picture. So, while it will
be nice if you get it work (since you already started doing it), don't take
it as a defeat if you eventually find out that it doesn't work very well.

Well, actually, this is the only big difference between the original
strategy using just a weighted average of failures; and the new strategy,
which performs *significantly better* in longer testing settings. It has
been working for a few weeks, and is up on github.

Either way, as I said before, I will, from today, focus on improving
precision of selecting code changes to estimate correlation with test