← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy


Hello Elena,
I am very sorry about that. The trees were left a bit messy with the
changes. I have pushed fixes for that just now. The file where you can
start now is basic_testcase.py. Before starting you should decompress
into csv/direct_file_changes.csv, and update the directory that contains
the input_test_lists in basic_testcase.py.

Regarding your previous email, the way the project works now is as follows:

1. Learning cycle. Populate information about tests.
2. Make predictions
3. Update results - in memory
4. Repeat from step 2

In this way, the project takes several minutes running 7000 rounds. The
'standard' strategy takes about 20 minutes, and the 'new' one takes about

When I think about the project I expect it to work different than this. In
real life (in buildbot), I believe the project would work by storing the
*test_info* data structure into a file, or into the database, and loading
it into memory every test_run, as follows:

1. Load data into memory (from database, or a file)
2. Make predictions
3. Update results - in memory
4. Save data (to database or a file)

Steps 2 and 3 are the same in both cases. It takes from 0.05 to 0.35
seconds to do each round of prediction and update of results (depending on
the length of the input list, and number of modified files for the 'new'
strategy). If we make it work like this, then we just need to add up the
time it would take to load up the data structure (and file_changes for the
'new' strategy). This should amount to less than a couple of seconds.

I can gather more detailed data regarding time if necessary. Let me know.


On Sun, Jul 27, 2014 at 6:16 PM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>

> Hi Pablo,
> Thanks for the update, I'm looking into it.
> There is one more important factor to choose which strategy to put the
> further effort on. Do they perform similarly time-wise?
> I mean, you now ran the same sets of tests on both strategies. Did it take
> approximately the same time? And in case you measured it, what about 3000 +
> 1 rounds, which is closer to the real-life test case?
> And what absolute time does one round take? I realize it depends on the
> machine and other things, but roughly -- is it seconds, or minutes, or tens
> of minutes?
> We should constantly watch it, because the whole point is to reduce test
> execution time; but the test execution time will include using the tool, so
> if it turns out that it takes as much time as we later save on tests, doing
> it makes little sense.
> Regards,
> Elena
> On 27.07.2014 11:51, Pablo Estrada wrote:
>> Hello Elena,
>> Concluding with the results of the recent experimentation, here is the
>> available information:
>> I have ported the basic code for the 'original' strategy into the
>> core-wrapper architecture, and uploaded it to the 'master' branch.
>> Now both strategies can be tested equivalently.
>> Branch: master <https://github.com/pabloem/Kokiri> - Original strategy,
>> using exponential decay. The performance increased a little bit after
>> incorporating randomizing of the end of the queue.
>> Branch: core-wrapper_architecture
>> <https://github.com/pabloem/Kokiri/tree/core-wrapper_architecture> -
>> 'New'
>> strategy using co occurrence between file changes and failures to
>> calculate
>> relevance.
>> I think they are both reasonably useful strategies. My theory is that the
>> 'original' strategy performs better with the input_test lists is that we
>> now know which tests ran, and so only the relevance of tests which ran is
>> affected (whereas previously, all tests were having their relevance
>> reduced). The tests were run with *3000 rounds of training* and *7000
>> rounds of prediction*.
>> I think that now the most reasonable option would be to gather data for a
>> longer period, just to be sure that the performance of the 'original'
>> strategy holds for the long term. We already discussed that it would be
>> desirable that buildbot incorporated functionality to keep track of which
>> tests were run, or considered to run (since buildbot already parses the
>> output of MTR, the changes should be quite quick, but I understand that
>> being a production system, extreme care must be had in the changes and the
>> design).
>> Finally, I fixed the chart comparing the results, sorry about the
>> confusion
>> yesterday.
>> ​
>> Let me know what you think, and how you'd like to proceed now. : )
>> Regards
>> Pablo
>> On Sat, Jul 26, 2014 at 8:26 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
>> wrote:
>>  Hi Elena,
>>> I just ran the tests comparing both strategies.
>>> To my surprise, according to the tests, the results from the 'original'
>>> strategy are a lot higher that the 'new' strategy. The difference in
>>> results might come from one of many possibilities, but I feel it's the
>>> following:
>>> Using the lists of run tests allows the relevance of a test to decrease
>>> only if it is considered to run and it runs. That way, tests with high
>>> relevance that would run, but were not in the list, don't run and thus
>>> are
>>> able to be hit their failures later on, rather than losing relevance.
>>> I will have charts in a few hours, and I will review the code more
>>> deeply,
>>> to make sure that the results are accurate. For now I can inform you that
>>> for a 50% size of the running set, the 'original' strategy, with no
>>> randomization, time factor or edit factor achieved a recall of 0.90 in
>>> the
>>> tests that I ran.
>>> Regards
>>> Pablo
>>> On Thu, Jul 24, 2014 at 8:18 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
>>> wrote:
>>>  Hi Elena,
>>>> On Thu, Jul 24, 2014 at 8:06 PM, Elena Stepanova <
>>>> elenst@xxxxxxxxxxxxxxxx
>>>>> wrote:
>>>>  Hi Pablo,
>>>>> Okay, thanks for the update.
>>>>> As I understand, the last two graphs were for the new strategy taking
>>>>> into account all edited files, no branch/platform, no time factor?
>>>> - Yes, new strategy. Using 'co-occurrence' of code file edits and
>>>> failures. Also a weighted average of failures.
>>>> - No time factor.
>>>> - No branch/platform scores are kept. The data for the tests is the
>>>> same,
>>>> no matter platform.
>>>> - But when calculating relevance, we use the failures occurred in the
>>>> last run as parameter. The last run does depend of branch and platform.
>>>>  Also, if it's not too long and if it's possible with your current code,
>>>>> can you run the old strategy on the same exact data, learning/running
>>>>> set,
>>>>> and input files, so that we could clearly see the difference?
>>>> I have not incorporated the logic for input file list for the old
>>>> strategy, but I will work on it, and it should be ready by tomorrow,
>>>> hopefully.
>>>>  I suppose your new tree does not include the input lists? Are you using
>>>>> the raw log files, or have you pre-processed them and made clean
>>>>> lists? If
>>>>> you are using the raw files, did you rename them?
>>>> It does not include them.
>>>> I am using the raw files. I included a tiny shell (downlaod_files.sh)
>>>> that you can execute to download and decompress the files in the
>>>> directory
>>>> where the program will look by default.
>>>> Also, I forgot to change it when uploading, but in basic_testcase.py,
>>>> you
>>>> would need to erase the file_dir parameter passed to s.wrapper(), so
>>>> that
>>>> the program defaults in looking for the files.
>>>> Regards
>>>> Pablo

Follow ups