maria-developers team mailing list archive

Thread
Date

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy

To: Pablo Estrada <polecito.em@xxxxxxxxx>
From: Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
Date: Thu, 31 Jul 2014 02:55:18 +0400
Cc: maria-developers <maria-developers@xxxxxxxxxxxxxxxxxxx>
In-reply-to: <CABDWuamwjiz_Fi6_0M6OL4gcnyfrmzSaUZ7pz9Mc7A1MeDjTXA@mail.gmail.com>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

Hi Pablo,

I've been looking into the current code, experimenting with it a bit,and also thinking how we can incorporate it into our process better.While I'm not done with it yet, I have some thoughts and concerns, so Iwill share them and you could consider them meanwhile.

First, about integration. I agree it makes much more sense to storemetrics rather than calculate them each time from scratch. As anadditional bonus, it becomes unimportant whether we make buildbot storelists of all executed tests as we initially discussed. If the toolstores metrics, it only needs to receive the list of tests at runtime,which seems much less invasive.


That is, the flow should be _roughly_ like this:

- buildbot starts MTR;
- MTR collects tests to be run;

- MTR calls the predicting part of the tool (Pythia), passes the list oftests and maybe some other information to it;- Pythia reads existing metrics, creates the list of tests to run andreturns it back to MTR;

- MTR attempts to run tests;

- MTR calls the calculating part of the tool (Statistician), passes thelist of executed tests and the list of failures to it;

- Statistician updates the metrics and writes them to the database

So, instead of modifying buildbot to write full lists of testssomewhere, what we need now is the Statistician which could work in thelearning mode. We create the table(s) for it -- either empty orpopulated with whatever we can calculate now -- and let it startlearning. Later when Pythia is ready, and Statistician collects certainamount of data, prediction can start.

Another bright side of storing the metrics is that in this case we don'treally care about finding the minimal size of the learning set, since itwon't affect performance anyhow.




Now, about the actual algorithm.

I chose the original strategy to look at, since it appeared to be betterin your experiments, and I doubt we'll have time to tune the new one tothat level.

One thing that I want to see there is fully developed platform mode. Isee that mode option is still there, so it should not be difficult. Iactually did it myself while experimenting, but since I only made hastyand crude changes, I don't expect them to be useful.

The point is, we've never actually evaluated it carefully enough. Whatyou did before in a non-standard mode was only calculate *metrics* perthe mode definer ("label", in your new design). But you still used theglobal counts for learning set, max count and such. So, an averagelearning set for a platform would be not 3000, but ~150, etc.

It would have been reasonable if we really had to go through all testruns every time, but it's not so. In fact, if you are doing predictionfor lets say 'labrador', you will only look at the data related to thisbuilder (platform), and take 3000 runs from there only.

After some thinking, I don't care much about branch and mixed modeanymore. Tuning per branch seems a good idea in theory, but it onlymakes sense with granulation per major version, which would be toocomplicated. Otherwise, it's counter-productive to differentiate letssay 5.5 tree and 5.5-<developer name> tree, it will only cause loss ofinformation.Earlier it had some value since we attempted to run totally irrelevanttests on a branch that couldn't possibly run them (e.g. tests for anengine which isn't even there). But after switching to the lists oftests to run, this problem is gone.

But the platform approach has a potential. If applied properly, itshould provide both precision and diversity, both of which should helpto achieve better results. So, it would be great to have it in the codeand in the schema ready for use, so we can re-evaluate it closer to theend.

Secondly, I think we've missed some important factors in recallcalculation. It is now both overly optimistic and overly pessimistic,and I'm not sure they outweigh each other.



Over-optimistic part:

When you calculate metrics, you use all test runs up to the current one.In reality, it's impossible. What really happens is:


- a new push arrived, revision N;
- test run 1: builder1 started tests for revision N;
- test run 2: builder2 started tests for revision N;
- test run 3: builder3 started tests for revision N;
- test X failed in test run 1;
- test X failed in test run 2;
- test X failed in test run 3;
- test run 1 finised;
- test run 2 finised;
- test run 3 finised;

It doesn't matter in which order they fail/finish; the problem is, whenbuilder2 starts, it doesn't have information about builder1 results, andbuilder3 doesn't know anything about the first two. So, the metric fortest X could not be increased yet.

But in your current calculation, it is. So, naturally, if we happen tocatch the failure on builder1, the metric raises dramatically, and thefailure will be definitely caught on builders 2 and 3.

It is especially important now, when you use incoming lists, and therunning sets might be not identical for builders 1-3 even in standard mode.

It might also be another reason why the platform mode was losing: whenyou calculate metrics per platform, test runs are serialized by default.

It shouldn't matter if we decide to start our real-life learning cyclefrom scratch, because it will necessarily use only available data. Butyou should consider that currently it affects your measurements.



Over-pessimistic part:

It is similar to the previous one, but look at the same problem from adifferent angle. Suppose the push broke test X, and the test startedfailing on all builders (platforms). So, you have 20 failures, one pertest run, for the same push. Now, suppose you caught it on one platformbut not on others. Your statistics will still show 19 failures missed vs1 failure caught, and recall will be dreadful (~0.05). But in fact, thegoal is achieved: the failure has been caught for this push. It doesn'treally matter whether you catch it 1 time or 20 times. So, recall hereshould be 1.

It should mainly affect per-platform approach, but probably the standardone can also suffer if running sets are not identical for all builders.




Finally, a couple of small details.

I wonder if it's because of different versions or anything, but thisdidn't work for me:


exp = re.compile('([^, ]+) ?([^ ]*)? *.*\[ (fail|disabled|pass|skipped) \]')

It would give me an error. I had to modify it this way:

exp = re.compile('([^, ]+) ?([^ ]*) *.*\[ (fail|disabled|pass|skipped) \]')

From what I see, it should be the same. If you agree, please make thesame change (or somehow else get rid of the error).

Also, it appears that csv/test_fail_history.csv is the old file. Ireplaced it with csv/fails_ptest_run.csv in the code. It doesn't matterfor the final version, but might be important for experiments.

Finally, I checked some of discrepancies in test lists that the toolreports. They are of different nature, but I don't think it's worthspending time on it. I would just skip all files that cause any kind ofconfusion for the script. It brings their number down to ~28,000 whichshould still be enough.



That's all for the moment. As I said, I'm still looking.

Regards,
Elena


On 27.07.2014 20:17, Pablo Estrada wrote:

Hello Elena,
I am very sorry about that. The trees were left a bit messy with the
changes. I have pushed fixes for that just now. The file where you can
start now is basic_testcase.py. Before starting you should decompress
csv/direct_file_changes.tar.gz
into csv/direct_file_changes.csv, and update the directory that contains
the input_test_lists in basic_testcase.py.

Regarding your previous email, the way the project works now is as follows:

1. Learning cycle. Populate information about tests.
2. Make predictions
3. Update results - in memory
4. Repeat from step 2

In this way, the project takes several minutes running 7000 rounds. The
'standard' strategy takes about 20 minutes, and the 'new' one takes about
25.

When I think about the project I expect it to work different than this. In
real life (in buildbot), I believe the project would work by storing the
*test_info* data structure into a file, or into the database, and loading
it into memory every test_run, as follows:

1. Load data into memory (from database, or a file)
2. Make predictions
3. Update results - in memory
4. Save data (to database or a file)

Steps 2 and 3 are the same in both cases. It takes from 0.05 to 0.35
seconds to do each round of prediction and update of results (depending on
the length of the input list, and number of modified files for the 'new'
strategy). If we make it work like this, then we just need to add up the
time it would take to load up the data structure (and file_changes for the
'new' strategy). This should amount to less than a couple of seconds.

I can gather more detailed data regarding time if necessary. Let me know.

Regards
Pablo



On Sun, Jul 27, 2014 at 6:16 PM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
wrote:

Hi Pablo,

Thanks for the update, I'm looking into it.

There is one more important factor to choose which strategy to put the
further effort on. Do they perform similarly time-wise?

I mean, you now ran the same sets of tests on both strategies. Did it take
approximately the same time? And in case you measured it, what about 3000 +
1 rounds, which is closer to the real-life test case?

And what absolute time does one round take? I realize it depends on the
machine and other things, but roughly -- is it seconds, or minutes, or tens
of minutes?

We should constantly watch it, because the whole point is to reduce test
execution time; but the test execution time will include using the tool, so
if it turns out that it takes as much time as we later save on tests, doing
it makes little sense.

Regards,
Elena




On 27.07.2014 11:51, Pablo Estrada wrote:

Hello Elena,
Concluding with the results of the recent experimentation, here is the
available information:
I have ported the basic code for the 'original' strategy into the
core-wrapper architecture, and uploaded it to the 'master' branch.
Now both strategies can be tested equivalently.
Branch: master <https://github.com/pabloem/Kokiri> - Original strategy,

using exponential decay. The performance increased a little bit after
incorporating randomizing of the end of the queue.
Branch: core-wrapper_architecture
<https://github.com/pabloem/Kokiri/tree/core-wrapper_architecture> -
'New'

strategy using co occurrence between file changes and failures to
calculate
relevance.

I think they are both reasonably useful strategies. My theory is that the
'original' strategy performs better with the input_test lists is that we
now know which tests ran, and so only the relevance of tests which ran is
affected (whereas previously, all tests were having their relevance
reduced). The tests were run with *3000 rounds of training* and *7000
rounds of prediction*.


I think that now the most reasonable option would be to gather data for a
longer period, just to be sure that the performance of the 'original'
strategy holds for the long term. We already discussed that it would be
desirable that buildbot incorporated functionality to keep track of which
tests were run, or considered to run (since buildbot already parses the
output of MTR, the changes should be quite quick, but I understand that
being a production system, extreme care must be had in the changes and the
design).

Finally, I fixed the chart comparing the results, sorry about the
confusion
yesterday.


Let me know what you think, and how you'd like to proceed now. : )
Regards

Pablo



On Sat, Jul 26, 2014 at 8:26 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
wrote:

  Hi Elena,

I just ran the tests comparing both strategies.
To my surprise, according to the tests, the results from the 'original'
strategy are a lot higher that the 'new' strategy. The difference in
results might come from one of many possibilities, but I feel it's the
following:

Using the lists of run tests allows the relevance of a test to decrease
only if it is considered to run and it runs. That way, tests with high
relevance that would run, but were not in the list, don't run and thus
are
able to be hit their failures later on, rather than losing relevance.

I will have charts in a few hours, and I will review the code more
deeply,
to make sure that the results are accurate. For now I can inform you that
for a 50% size of the running set, the 'original' strategy, with no
randomization, time factor or edit factor achieved a recall of 0.90 in
the
tests that I ran.

Regards
Pablo


On Thu, Jul 24, 2014 at 8:18 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
wrote:

  Hi Elena,


On Thu, Jul 24, 2014 at 8:06 PM, Elena Stepanova <
elenst@xxxxxxxxxxxxxxxx

wrote:


  Hi Pablo,


Okay, thanks for the update.

As I understand, the last two graphs were for the new strategy taking
into account all edited files, no branch/platform, no time factor?

- Yes, new strategy. Using 'co-occurrence' of code file edits and
failures. Also a weighted average of failures.
- No time factor.
- No branch/platform scores are kept. The data for the tests is the
same,
no matter platform.
- But when calculating relevance, we use the failures occurred in the
last run as parameter. The last run does depend of branch and platform.


  Also, if it's not too long and if it's possible with your current code,

can you run the old strategy on the same exact data, learning/running
set,
and input files, so that we could clearly see the difference?

I have not incorporated the logic for input file list for the old
strategy, but I will work on it, and it should be ready by tomorrow,
hopefully.


  I suppose your new tree does not include the input lists? Are you using

the raw log files, or have you pre-processed them and made clean
lists? If
you are using the raw files, did you rename them?

It does not include them.

I am using the raw files. I included a tiny shell (downlaod_files.sh)
that you can execute to download and decompress the files in the
directory
where the program will look by default.
Also, I forgot to change it when uploading, but in basic_testcase.py,
you
would need to erase the file_dir parameter passed to s.wrapper(), so
that
the program defaults in looking for the files.

Regards
Pablo

Follow ups

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-08-03

References

[GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-06-27
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-15
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-17
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-21
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Elena Stepanova, 2014-07-21
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-23
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Elena Stepanova, 2014-07-23
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-24
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-24
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Elena Stepanova, 2014-07-24
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-24
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-26
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-27
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Elena Stepanova, 2014-07-27
Re: [GSoC] Optimize mysql-test-runs - Results of new strategy
From: Pablo Estrada, 2014-07-27