Re: [GSoC] Optimize mysql-test-runs - Results of new strategy


Hello Elena,
It took me a while to figure out how the files and the test_run s
correspond to each other, and there might still be some hard-to-solve
inconsistencies with them: There were a few cases where it is not easy to
determine -automatically- which file corresponds to which test_run (some
cases where there are more platform+build test_runs than files)... but
excluding those cases, yes, there are about 28k files that can be matched
to test_runs appropriately.

The distribution of these is quite random. They start matching around
test_run #10,000 and then, they go on matching sometimes and sometimes not.

What I'm doing, is the following:

   1. If there is a file that matches this test_run: Parse the file, and
   return the tests in the file as the input list. I am not considering
   'skipped' tests, because it seems that they are skipped because they can't
   be run.
   2. If there is no file matching test_run: Consider ALL known tests as
   being in the input list.

I would like to get some of your feedback on a couple of things:

   - I would still like to define some structure for the interfaces -even
   if a bit loose.
   - You mentioned earlier that rather than a specific running_set, you
   wanted to use a percentage. We can work like this.
   - Do you have any feedback on points 1 and 2 regarding the handling of
   the input test lists?

And one more thing:

   - I have not incorporated test variant into the data, but I'll spend
   some time thinking about how to do this.

That's it for now.


On Wed, Jul 16, 2014 at 1:10 AM, Pablo Estrada <polecito.em@xxxxxxxxx>

> Hi Elena,
> A small progress report: I was able to quickly make the changes related to
> selecting code changes to measure correlations with test failures. Recall
> is still around 80% with running set of 300 and short prediction stages. I
> can focus now on the input file list, since I believe this will make
> results more realistic, and (I expect)  help push recall a further up.
> Can you please upload the report files from MTR, so that I can include the
> logic of an input test list?
> Also, since I am going to incorporate this logic, it might be good to
> define (even if just roughly) the "core module" and the "wrapper module"
> that you had mentioned earlier, rather than just incorporating the list,
> and making the code that I have now even more bloated with mixed up
> functionalities. What do you think?
> Regards
> Pablo
> On Tue, Jul 15, 2014 at 2:18 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
> wrote:
>> Hello Elena,
>> Can you give a raw estimation of a ratio of failures missed due to being
>>> low in the priority queue vs those that were not in the queue at all?
>> I sent this information in a previous email, here:
>> https://lists.launchpad.net/maria-developers/msg07482.html
>> Also, once again, I would like you to start using an incoming test list
>>> as an initial point of your test set generation. It must be done sooner or
>>> later, I already explained earlier why; and while it's not difficult to
>>> implement even after the end of your project, it might affect the result
>>> considerably, so we need to know if it makes it better or worse, and adjust
>>> the algorithm accordingly.
>>> You are right. I understand that this information is not fully available
>> for all the test_runs, so can you upload the information going back as much
>> as possible? I can parse these files and adjust the program to work with
>> this. I will get on to work with this, I think this should significantly
>> improve results. I think, it might even push my current strategy from
>> promising results into attractive ones.
>>> There are several options which change the way the tests are executed;
>>> e.g. tests can be run in a "normal" mode, or in PS protocol mode, or with
>>> valgrind, or with embedded server. And it might well be that some tests
>>> always fail e.g. with valgrind, but almost never fail otherwise.
>>> Information about these options is partially available in test_run.info,
>>> but it would require some parsing. It would be perfect if you could analyze
>>> the existing data to understand whether using it can affect your results
>>> before spending time on actual code changes.
>> I will keep this in consideration, but for now I will focus on these two
>> main things:
>>    - Improving precision of selecting code changes to estimate
>>    correlation with test failures
>>    - Adding the use of an incoming test list
>>> When we are trying to watch all code changes and find correlation with
>>> test failures, if it's done well, it should actually provide immediate
>>> gain; however, it's very difficult to do it right, there is way too much
>>> noise in the statistical data to get a reliable picture. So, while it will
>>> be nice if you get it work (since you already started doing it), don't take
>>> it as a defeat if you eventually find out that it doesn't work very well.
>> Well, actually, this is the only big difference between the original
>> strategy using just a weighted average of failures; and the new strategy,
>> which performs *significantly better* in longer testing settings. It has
>> been working for a few weeks, and is up on github.
>> Either way, as I said before, I will, from today, focus on improving
>> precision of selecting code changes to estimate correlation with test
>> failures.
>> Regards
>> Pablo

