← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy


Hi Elena,
It's on a new branch in the same repository, you can see it here:

I changed the whole simulator.py file. I made sure to comment the header of
every function, but don't have too many in-line comments. You can let me
know if you need more clarifications.


On Sun, Jun 29, 2014 at 6:57 PM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>

> Hi Pablo,
> On 29.06.2014 6:25, Pablo Estrada wrote:
>> Hi Elena and all,
>> I guess I should admit that my excitement was a bit too much; but also I'm
>> definitely not 'jumping' into this strategy. As I said, I am trying to use
>> the lessons learned from all the experiments to make the best predictions.
>> That being said, a strong point about the new strategy is that rather than
>> purely use failure rate to predict failure rate, it uses more data to try
>> to make predictions - and it experiences more consistency of prediction.
>> On
>> the 3k-training and 2k-predicting simulations its advantage is not so
>> apparent (they fare similarly, with the 'standard' strategy being the best
>> one), but it becomes more evident with longer predicting.
>> I ran tests with 20k-training rounds and 20k-prediction rounds, and the
>> new
>> strategy fared a lot better. I have attached charts with comparisons of
>> both of them. We can observe that with a running set of 500, the original
>> algorithm had a very nice almost 95% recall in shorter tests, but it falls
>> to less than 50% with longer testing (And it must be a lot lower if we
>> average the last couple of thousand runs, rathen the the 20k simulation
>> runs together)
> Okay, thanks, it looks much more convincing.
> Indeed, as we already discussed before, the problem with the previous
> strategy or implementation is that the recall deteriorates quickly after
> you stop using complete results as the learning material, and start taking
> into account only simulated results (which is what would be happening in
> real life). If the new method helps to solve this problem, it's worth
> looking into.
> You mentioned before that you pushed the new code, where is it located?
> I'd like to look at it before making any further conclusions.
> Regards,
> Elena
>> Since the goal of the project is to provide consistent long-term test
>> optimization, we would want to take all we can learn from the new strategy
>> - and, improve the consistency of the recall over long-term simulation.
>> Nevertheless, I agree that there are important lessons in the original
>> strategy, particularly that >90% recall ion shorter prediction periods.
>> That's why I'm still tuning and testing.
>> Again, all advice and observations are welcome.
>> Hope everyone is having a nice weekend.
>> Pablo
>> On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova <
>> elenst@xxxxxxxxxxxxxxxx>
>> wrote:
>>  Hi Pablo,
>>> Could you please explain why you are considering the new results being
>>> better? I don't see any obvious improvement.
>>> As I understand from the defaults, previously you were running tests with
>>> 2000 training rounds and 3000 simulation rounds, and you've already had
>>> ~70% on 300 runs and ~80% on 500 runs, see your email of June 19,
>>> no_options_simulation.jpg.
>>> Now you have switched the limits, you are running with 3000 training and
>>> 2000 simulation rounds. It makes a big difference, if you re-run tests
>>> with
>>> the old algorithm with the new limits, you'll get +10% easily, thus RS
>>> 300
>>> will be around the same 80%, and RS 500 should be even higher, pushing
>>> 90%,
>>> while now you have barely 85%.
>>> Before jumping onto the new algorithm, please provide the comparison of
>>> the old and new approach with equal pre-conditions and parameters.
>>> Thanks,
>>> Elena
>>> On 28.06.2014 6:44, Pablo Estrada wrote:
>>>  Hi all,
>>>> well, as I said, I have incorporated a very simple weighted failure rate
>>>> into the strategy, and I have found quite encouraging results. The
>>>> recall
>>>> looks better than earlier tests. I am attaching two charts with data
>>>> compiled from runs with 3000 training rounds and 2000 simulation (5000
>>>> test
>>>> runs analyzed in total):
>>>>      - The recall by running set size (As shown, it reaches 80% with 300
>>>>      tests)
>>>>      - The index of failure in the priority queue (running set: 500,
>>>> training
>>>>      3000, simulation 2000)
>>>> It is interesting to look at chart number 2:
>>>> The first 10 or so places have a very high count of found failures.
>>>> These
>>>> most likely come from repeated failures (tests that failed in the
>>>> previous
>>>> run and were caught in the next one). The next ones have a skew to the
>>>> right, and these come from the file-change model.
>>>> I am glad of these new results : ). I have a couple new ideas to try to
>>>> push the recall a bit further up, but I wanted to show the progress
>>>> first.
>>>> Also, I will do a thorough code review before any new changes, to make
>>>> sure
>>>> that the results are valid. Interestingly enough, in this new strategy
>>>> the
>>>> code is simpler.
>>>> Also, I will run a test with a more long term period (20,000 training,
>>>> 20,000 simulation), to see if the recall degrades as time passes and we
>>>> miss more failures.
>>>> Regards!
>>>> Pablo
>>>> On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
>>>> wrote:
>>>>   Hello everyone,
>>>>> I took the last couple of days working on a new strategy to calculate
>>>>> the
>>>>> relevance of a test. The results are not sufficient by themselves, but
>>>>> I
>>>>> believe they point to an interesting direction. This strategy uses that
>>>>> rate of co-occurrence of events to estimate the relevance of a test,
>>>>> and
>>>>> the events that it uses are the following:
>>>>>      - File editions since last run
>>>>>      - Test failure in last run
>>>>> The strategy has also two stages:
>>>>>      1. Training stage
>>>>>      2. Executing stage
>>>>> In the training stage, it goes through the available data, and does the
>>>>> following:
>>>>>      - If test A failed:
>>>>>      - It counts and stores all the files that were edited since the
>>>>> last
>>>>>      test_run (the last test_run depends on BRANCH, PLATFORM, and other
>>>>> factors)
>>>>>      - If test A failed also in the previous test run, it also counts
>>>>> that
>>>>> In the executing stage, the training algorithm is still applied, but
>>>>> the
>>>>> decision of whether a test runs is based on its relevance, the
>>>>> relevance
>>>>> is
>>>>> calculated as the sum of the following:
>>>>>      - The percentage of times a test has failed in two subsequent
>>>>>      test_runs, multiplied by whether the test failed in the previous
>>>>> run
>>>>> (if
>>>>>      the test didn't fail in the previous run, this quantity is 0)
>>>>>      - For each file that was edited since the last test_run, the
>>>>>      percentage of times that the test has failed after this file was
>>>>> edited
>>>>> (The explanation is a bit clumsy, I can clear it up if you wish so)
>>>>> The results have not been too bad, nor too good. With a running set of
>>>>> 200
>>>>> tests, a training phase of 3000 test runs, and an executing stage of
>>>>> 2000
>>>>> test runs, I have achieved recall of 0.50. It's not too great, nor too
>>>>> bad.
>>>>> Nonetheless, while running tests, I found something interesting:
>>>>>      - I removed the first factor of the relevance. I decided to not
>>>>> care
>>>>>      about whether a test failed in the previous test run. I was only
>>>>> using the
>>>>>      file-change factor. Naturally, the recall decreased, from 0.50 to
>>>>> 0.39 (the
>>>>>      decrease was not too big)... and the distribution of failed tests
>>>>> in
>>>>> the
>>>>>      priority queue had a good skew towards the front of the queue (so
>>>>> it
>>>>> seems
>>>>>      that the files help somewhat, to indicate the likelihood of a
>>>>> failure). I
>>>>>      attached this chart.
>>>>> An interesting problem that I encountered was that about 50% of the
>>>>> test_runs don't have any file changes nor test failures, and so the
>>>>> relevance of all tests is zero. Here is where the original strategy (a
>>>>> weighted average of failures) could be useful, so that even if we don't
>>>>> have any information to guess which tests to run, we just go ahead and
>>>>> run
>>>>> the ones that have failed the most, recently.
>>>>> I will work on mixing up both strategies a bit in the next few days,
>>>>> and
>>>>> see what comes of that.
>>>>> By the way, I pushed the code to github. The code is completely
>>>>> different,
>>>>> so may be better to wait until I have new results soon.
>>>>> Regards!
>>>>> Pablo

Follow ups