← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy


Hi Elena and all,
I guess I should admit that my excitement was a bit too much; but also I'm
definitely not 'jumping' into this strategy. As I said, I am trying to use
the lessons learned from all the experiments to make the best predictions.

That being said, a strong point about the new strategy is that rather than
purely use failure rate to predict failure rate, it uses more data to try
to make predictions - and it experiences more consistency of prediction. On
the 3k-training and 2k-predicting simulations its advantage is not so
apparent (they fare similarly, with the 'standard' strategy being the best
one), but it becomes more evident with longer predicting.

I ran tests with 20k-training rounds and 20k-prediction rounds, and the new
strategy fared a lot better. I have attached charts with comparisons of
both of them. We can observe that with a running set of 500, the original
algorithm had a very nice almost 95% recall in shorter tests, but it falls
to less than 50% with longer testing (And it must be a lot lower if we
average the last couple of thousand runs, rathen the the 20k simulation
runs together)

Since the goal of the project is to provide consistent long-term test
optimization, we would want to take all we can learn from the new strategy
- and, improve the consistency of the recall over long-term simulation.

Nevertheless, I agree that there are important lessons in the original
strategy, particularly that >90% recall ion shorter prediction periods.
That's why I'm still tuning and testing.

Again, all advice and observations are welcome.
Hope everyone is having a nice weekend.

On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>

> Hi Pablo,
> Could you please explain why you are considering the new results being
> better? I don't see any obvious improvement.
> As I understand from the defaults, previously you were running tests with
> 2000 training rounds and 3000 simulation rounds, and you've already had
> ~70% on 300 runs and ~80% on 500 runs, see your email of June 19,
> no_options_simulation.jpg.
> Now you have switched the limits, you are running with 3000 training and
> 2000 simulation rounds. It makes a big difference, if you re-run tests with
> the old algorithm with the new limits, you'll get +10% easily, thus RS 300
> will be around the same 80%, and RS 500 should be even higher, pushing 90%,
> while now you have barely 85%.
> Before jumping onto the new algorithm, please provide the comparison of
> the old and new approach with equal pre-conditions and parameters.
> Thanks,
> Elena
> On 28.06.2014 6:44, Pablo Estrada wrote:
>> Hi all,
>> well, as I said, I have incorporated a very simple weighted failure rate
>> into the strategy, and I have found quite encouraging results. The recall
>> looks better than earlier tests. I am attaching two charts with data
>> compiled from runs with 3000 training rounds and 2000 simulation (5000
>> test
>> runs analyzed in total):
>>     - The recall by running set size (As shown, it reaches 80% with 300
>>     tests)
>>     - The index of failure in the priority queue (running set: 500,
>> training
>>     3000, simulation 2000)
>> It is interesting to look at chart number 2:
>> The first 10 or so places have a very high count of found failures. These
>> most likely come from repeated failures (tests that failed in the previous
>> run and were caught in the next one). The next ones have a skew to the
>> right, and these come from the file-change model.
>> I am glad of these new results : ). I have a couple new ideas to try to
>> push the recall a bit further up, but I wanted to show the progress first.
>> Also, I will do a thorough code review before any new changes, to make
>> sure
>> that the results are valid. Interestingly enough, in this new strategy the
>> code is simpler.
>> Also, I will run a test with a more long term period (20,000 training,
>> 20,000 simulation), to see if the recall degrades as time passes and we
>> miss more failures.
>> Regards!
>> Pablo
>> On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
>> wrote:
>>  Hello everyone,
>>> I took the last couple of days working on a new strategy to calculate the
>>> relevance of a test. The results are not sufficient by themselves, but I
>>> believe they point to an interesting direction. This strategy uses that
>>> rate of co-occurrence of events to estimate the relevance of a test, and
>>> the events that it uses are the following:
>>>     - File editions since last run
>>>     - Test failure in last run
>>> The strategy has also two stages:
>>>     1. Training stage
>>>     2. Executing stage
>>> In the training stage, it goes through the available data, and does the
>>> following:
>>>     - If test A failed:
>>>     - It counts and stores all the files that were edited since the last
>>>     test_run (the last test_run depends on BRANCH, PLATFORM, and other
>>> factors)
>>>     - If test A failed also in the previous test run, it also counts that
>>> In the executing stage, the training algorithm is still applied, but the
>>> decision of whether a test runs is based on its relevance, the relevance
>>> is
>>> calculated as the sum of the following:
>>>     - The percentage of times a test has failed in two subsequent
>>>     test_runs, multiplied by whether the test failed in the previous run
>>> (if
>>>     the test didn't fail in the previous run, this quantity is 0)
>>>     - For each file that was edited since the last test_run, the
>>>     percentage of times that the test has failed after this file was
>>> edited
>>> (The explanation is a bit clumsy, I can clear it up if you wish so)
>>> The results have not been too bad, nor too good. With a running set of
>>> 200
>>> tests, a training phase of 3000 test runs, and an executing stage of 2000
>>> test runs, I have achieved recall of 0.50. It's not too great, nor too
>>> bad.
>>> Nonetheless, while running tests, I found something interesting:
>>>     - I removed the first factor of the relevance. I decided to not care
>>>     about whether a test failed in the previous test run. I was only
>>> using the
>>>     file-change factor. Naturally, the recall decreased, from 0.50 to
>>> 0.39 (the
>>>     decrease was not too big)... and the distribution of failed tests in
>>> the
>>>     priority queue had a good skew towards the front of the queue (so it
>>> seems
>>>     that the files help somewhat, to indicate the likelihood of a
>>> failure). I
>>>     attached this chart.
>>> An interesting problem that I encountered was that about 50% of the
>>> test_runs don't have any file changes nor test failures, and so the
>>> relevance of all tests is zero. Here is where the original strategy (a
>>> weighted average of failures) could be useful, so that even if we don't
>>> have any information to guess which tests to run, we just go ahead and
>>> run
>>> the ones that have failed the most, recently.
>>> I will work on mixing up both strategies a bit in the next few days, and
>>> see what comes of that.
>>> By the way, I pushed the code to github. The code is completely
>>> different,
>>> so may be better to wait until I have new results soon.
>>> Regards!
>>> Pablo

Attachment: 3k2k_strat_comparison.png
Description: PNG image

Attachment: 20k20k_strat_comparison.png
Description: PNG image

Follow ups