maria-developers team mailing list archive

Thread
Date

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy

To: Pablo Estrada <polecito.em@xxxxxxxxx>
From: Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
Date: Tue, 15 Jul 2014 01:14:19 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CABDWuamLsaoKyOL4vTW7q4L34T_Pkf845SrbsnKzJy=CbpyQSA@mail.gmail.com>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

Hi Pablo,

Can you give a raw estimation of a ratio of failures missed due to beinglow in the priority queue vs those that were not in the queue at all?

If you can't, don't waste time on getting this information, but sinceyou have already analyzed the data, I thought you might have an answerright away.

Also, once again, I would like you to start using an incoming test listas an initial point of your test set generation. It must be done sooneror later, I already explained earlier why; and while it's not difficultto implement even after the end of your project, it might affect theresult considerably, so we need to know if it makes it better or worse,and adjust the algorithm accordingly.You don't need to create an actual interface with MTR, any form of thetest list will do, as soon as it's the correct list.

Also, there is another dimension in the test suite which I think youdidn't take into account before, and which might be useful, especiallyif you run out of ideas.There are several options which change the way the tests are executed;e.g. tests can be run in a "normal" mode, or in PS protocol mode, orwith valgrind, or with embedded server. And it might well be that sometests always fail e.g. with valgrind, but almost never fail otherwise.Information about these options is partially available in test_run.info,but it would require some parsing. It would be perfect if you couldanalyze the existing data to understand whether using it can affect yourresults before spending time on actual code changes.



A couple more comments inline.


On 14.07.2014 19:37, Pablo Estrada wrote:

Hello everyone,
I spent these days observing some of the missed failures, and trying to
tune the strategy to improve recall. I have been trying to look at how the
relevance of missed failures was not enough, and what would be a good tune
up to the strategy to prevent these problems. In the current strategy, the
relevance of a test depends mostly on one of the three factors:


    1. A test failed in the previous run. These tests have a high relevance,
    usually around 1, and tend to be at the very front of the priority queue.
    These tests are the high bars on 1,2,3 and 4.

This is expected. It happens because usually it takes some time for adeveloper to notice a new failure, to fix it and push the fix; in themeantime, several more runs are executed.These failures are indeed easy to predict. Unfortunately, they are notthe most interesting ones, but as we agreed, we won't categorizefailures as "more important" and "less important". In any case, it'sgood to have them caught.

    2. Files related to this test were changed. These tests have a rather
    low relevance, but tend to be at the beginning of the priority queue. Some
    of the missed failures come from here. We might be able to avoid missing
    these tests by being more precise when checking relationship between
    changed files and failed tests.

Are you talking about test files (.test and .result, and maybe .inc), orany code changes?


There might be a difference based on real-life practices.

If a test/result file gets changed, it's almost certain that the personwho changed it actually ran the test and made sure it passes. It canstill happen that it fails on a different platform, but it's ratherrare. So, using this information is not expected to provide bigshort-term gain. However, it is still very important to take it intoaccount, because tests that get changed are the most "modern" ones, andhence should be run. Luckily, it's easy enough to do.

When we are trying to watch all code changes and find correlation withtest failures, if it's done well, it should actually provide immediategain; however, it's very difficult to do it right, there is way too muchnoise in the statistical data to get a reliable picture. So, while itwill be nice if you get it work (since you already started doing it),don't take it as a defeat if you eventually find out that it doesn'twork very well.

    3. No files were changed, but the weighted failure rate is higher for
    this test. These tests tend to have low relevance. Some of the missed
    failures are here. They usually are tests that failed too long ago, and
    have become irrelevant (and since no files were changed, their relevance
    pales compared to tests that failed more recently). These tests are very
    hard to predict. Randomization can be helpful *in practice*, but with
    the data that we have now, randomization does not improve recall very much
    - just makes it vary a tiny bit, up and down. I can go further on why I
    think this would be a good measure in practice, but doesn't work for our
    data.

If you are using your former "standard" method (where no branch/platforminformation is taken into account), and if you only randomize the emptytail of the queue, then randomization should make no difference at all:you'll have more failures in the queue than the target size of the testset, so there is simply no room for randomization. Or, maybe you aretalking about some other kind of randomization, I'll look at the code(please make sure it's up-to-date in git).



Regards,
/E


So here is what I think can be done to try to improve recall:

    1. Tests that failed in the previous run: These are quite fine. No need
    to improve this metric.
    2. Become more precise when assessing which file changes correspond to
    which test_run. Right now, we take EVERY test change that happened between
    previous_run and next_run. That includes files that went to other branches.
    Instead of doing this, I plan to use fie_changes that are related to
    test_runs through the
    changes-sourcestamp-buildset-buildrequests-builds-test_run chain of
    relationships. I still have not analyzed this data, but I believe it should
    be workable. *I will work on this the next few days.*
    3. Weighted failure rate and randomization are an interesting option,
    but I believe this would be more useful in practice; and so it would
    require an extra phase in the project, and time is limited. (We would
    require a period of comparing predictions with results in buildbot). I am
    definitely willing to consider working on this, but I guess now we should
    focus on the Aug 16 deadline.

Again, if anyone sees any 'area of opportunity', or has any advice, it's
all welcome.
Regards
Pablo

On Sun, Jun 29, 2014 at 7:25 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
wrote:

Hi Elena,
It's on a new branch in the same repository, you can see it here:
https://github.com/pabloem/Kokiri/tree/file_correlation

I changed the whole simulator.py file. I made sure to comment the header
of every function, but don't have too many in-line comments. You can let me
know if you need more clarifications.

Regards
Pablo


On Sun, Jun 29, 2014 at 6:57 PM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
wrote:

Hi Pablo,


On 29.06.2014 6:25, Pablo Estrada wrote:

Hi Elena and all,
I guess I should admit that my excitement was a bit too much; but also
I'm
definitely not 'jumping' into this strategy. As I said, I am trying to
use
the lessons learned from all the experiments to make the best
predictions.

That being said, a strong point about the new strategy is that rather
than
purely use failure rate to predict failure rate, it uses more data to try
to make predictions - and it experiences more consistency of prediction.
On
the 3k-training and 2k-predicting simulations its advantage is not so
apparent (they fare similarly, with the 'standard' strategy being the
best
one), but it becomes more evident with longer predicting.

I ran tests with 20k-training rounds and 20k-prediction rounds, and the
new
strategy fared a lot better. I have attached charts with comparisons of
both of them. We can observe that with a running set of 500, the original
algorithm had a very nice almost 95% recall in shorter tests, but it
falls
to less than 50% with longer testing (And it must be a lot lower if we
average the last couple of thousand runs, rathen the the 20k simulation
runs together)


Okay, thanks, it looks much more convincing.
Indeed, as we already discussed before, the problem with the previous
strategy or implementation is that the recall deteriorates quickly after
you stop using complete results as the learning material, and start taking
into account only simulated results (which is what would be happening in
real life). If the new method helps to solve this problem, it's worth
looking into.

You mentioned before that you pushed the new code, where is it located?
I'd like to look at it before making any further conclusions.

Regards,
Elena

Since the goal of the project is to provide consistent long-term test
optimization, we would want to take all we can learn from the new
strategy
- and, improve the consistency of the recall over long-term simulation.

Nevertheless, I agree that there are important lessons in the original
strategy, particularly that >90% recall ion shorter prediction periods.
That's why I'm still tuning and testing.

Again, all advice and observations are welcome.
Hope everyone is having a nice weekend.
Pablo


On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova <
elenst@xxxxxxxxxxxxxxxx>
wrote:

  Hi Pablo,


Could you please explain why you are considering the new results being
better? I don't see any obvious improvement.

As I understand from the defaults, previously you were running tests
with
2000 training rounds and 3000 simulation rounds, and you've already had
~70% on 300 runs and ~80% on 500 runs, see your email of June 19,
no_options_simulation.jpg.

Now you have switched the limits, you are running with 3000 training and
2000 simulation rounds. It makes a big difference, if you re-run tests
with
the old algorithm with the new limits, you'll get +10% easily, thus RS
300
will be around the same 80%, and RS 500 should be even higher, pushing
90%,
while now you have barely 85%.

Before jumping onto the new algorithm, please provide the comparison of
the old and new approach with equal pre-conditions and parameters.

Thanks,
Elena



On 28.06.2014 6:44, Pablo Estrada wrote:

  Hi all,

well, as I said, I have incorporated a very simple weighted failure
rate
into the strategy, and I have found quite encouraging results. The
recall
looks better than earlier tests. I am attaching two charts with data
compiled from runs with 3000 training rounds and 2000 simulation (5000
test
runs analyzed in total):

      - The recall by running set size (As shown, it reaches 80% with
300
      tests)
      - The index of failure in the priority queue (running set: 500,
training

      3000, simulation 2000)

It is interesting to look at chart number 2:
The first 10 or so places have a very high count of found failures.
These
most likely come from repeated failures (tests that failed in the
previous
run and were caught in the next one). The next ones have a skew to the
right, and these come from the file-change model.

I am glad of these new results : ). I have a couple new ideas to try to
push the recall a bit further up, but I wanted to show the progress
first.
Also, I will do a thorough code review before any new changes, to make
sure
that the results are valid. Interestingly enough, in this new strategy
the
code is simpler.
Also, I will run a test with a more long term period (20,000 training,
20,000 simulation), to see if the recall degrades as time passes and we
miss more failures.

Regards!
Pablo


On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
wrote:

   Hello everyone,

I took the last couple of days working on a new strategy to calculate
the
relevance of a test. The results are not sufficient by themselves,
but I
believe they point to an interesting direction. This strategy uses
that
rate of co-occurrence of events to estimate the relevance of a test,
and
the events that it uses are the following:

      - File editions since last run
      - Test failure in last run


The strategy has also two stages:

      1. Training stage
      2. Executing stage


In the training stage, it goes through the available data, and does
the
following:

      - If test A failed:
      - It counts and stores all the files that were edited since the
last

      test_run (the last test_run depends on BRANCH, PLATFORM, and
other
factors)
      - If test A failed also in the previous test run, it also counts
that


In the executing stage, the training algorithm is still applied, but
the
decision of whether a test runs is based on its relevance, the
relevance
is
calculated as the sum of the following:

      - The percentage of times a test has failed in two subsequent

      test_runs, multiplied by whether the test failed in the previous
run
(if
      the test didn't fail in the previous run, this quantity is 0)
      - For each file that was edited since the last test_run, the

      percentage of times that the test has failed after this file was
edited

(The explanation is a bit clumsy, I can clear it up if you wish so)
The results have not been too bad, nor too good. With a running set of
200
tests, a training phase of 3000 test runs, and an executing stage of
2000
test runs, I have achieved recall of 0.50. It's not too great, nor too
bad.

Nonetheless, while running tests, I found something interesting:

      - I removed the first factor of the relevance. I decided to not
care

      about whether a test failed in the previous test run. I was only
using the
      file-change factor. Naturally, the recall decreased, from 0.50 to
0.39 (the
      decrease was not too big)... and the distribution of failed
tests in
the
      priority queue had a good skew towards the front of the queue
(so it
seems
      that the files help somewhat, to indicate the likelihood of a
failure). I
      attached this chart.

An interesting problem that I encountered was that about 50% of the
test_runs don't have any file changes nor test failures, and so the
relevance of all tests is zero. Here is where the original strategy (a
weighted average of failures) could be useful, so that even if we
don't
have any information to guess which tests to run, we just go ahead and
run
the ones that have failed the most, recently.

I will work on mixing up both strategies a bit in the next few days,
and
see what comes of that.

By the way, I pushed the code to github. The code is completely
different,
so may be better to wait until I have new results soon.

Regards!
Pablo

Follow ups

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-15

References

[GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-06-27
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-06-28
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Elena Stepanova, 2014-06-28
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-06-29
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Elena Stepanova, 2014-06-29
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-06-29
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-14