Thread Previous • Date Previous • Date Next • Thread Next |
Hi Elena and all, I guess I should admit that my excitement was a bit too much; but also I'm definitely not 'jumping' into this strategy. As I said, I am trying to use the lessons learned from all the experiments to make the best predictions. That being said, a strong point about the new strategy is that rather than purely use failure rate to predict failure rate, it uses more data to try to make predictions - and it experiences more consistency of prediction. On the 3k-training and 2k-predicting simulations its advantage is not so apparent (they fare similarly, with the 'standard' strategy being the best one), but it becomes more evident with longer predicting. I ran tests with 20k-training rounds and 20k-prediction rounds, and the new strategy fared a lot better. I have attached charts with comparisons of both of them. We can observe that with a running set of 500, the original algorithm had a very nice almost 95% recall in shorter tests, but it falls to less than 50% with longer testing (And it must be a lot lower if we average the last couple of thousand runs, rathen the the 20k simulation runs together) Since the goal of the project is to provide consistent long-term test optimization, we would want to take all we can learn from the new strategy - and, improve the consistency of the recall over long-term simulation. Nevertheless, I agree that there are important lessons in the original strategy, particularly that >90% recall ion shorter prediction periods. That's why I'm still tuning and testing. Again, all advice and observations are welcome. Hope everyone is having a nice weekend. Pablo On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx> wrote: > Hi Pablo, > > Could you please explain why you are considering the new results being > better? I don't see any obvious improvement. > > As I understand from the defaults, previously you were running tests with > 2000 training rounds and 3000 simulation rounds, and you've already had > ~70% on 300 runs and ~80% on 500 runs, see your email of June 19, > no_options_simulation.jpg. > > Now you have switched the limits, you are running with 3000 training and > 2000 simulation rounds. It makes a big difference, if you re-run tests with > the old algorithm with the new limits, you'll get +10% easily, thus RS 300 > will be around the same 80%, and RS 500 should be even higher, pushing 90%, > while now you have barely 85%. > > Before jumping onto the new algorithm, please provide the comparison of > the old and new approach with equal pre-conditions and parameters. > > Thanks, > Elena > > > > On 28.06.2014 6:44, Pablo Estrada wrote: > >> Hi all, >> well, as I said, I have incorporated a very simple weighted failure rate >> into the strategy, and I have found quite encouraging results. The recall >> looks better than earlier tests. I am attaching two charts with data >> compiled from runs with 3000 training rounds and 2000 simulation (5000 >> test >> runs analyzed in total): >> >> - The recall by running set size (As shown, it reaches 80% with 300 >> tests) >> - The index of failure in the priority queue (running set: 500, >> training >> >> 3000, simulation 2000) >> >> It is interesting to look at chart number 2: >> The first 10 or so places have a very high count of found failures. These >> most likely come from repeated failures (tests that failed in the previous >> run and were caught in the next one). The next ones have a skew to the >> right, and these come from the file-change model. >> >> I am glad of these new results : ). I have a couple new ideas to try to >> push the recall a bit further up, but I wanted to show the progress first. >> Also, I will do a thorough code review before any new changes, to make >> sure >> that the results are valid. Interestingly enough, in this new strategy the >> code is simpler. >> Also, I will run a test with a more long term period (20,000 training, >> 20,000 simulation), to see if the recall degrades as time passes and we >> miss more failures. >> >> Regards! >> Pablo >> >> >> On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito.em@xxxxxxxxx> >> wrote: >> >> Hello everyone, >>> I took the last couple of days working on a new strategy to calculate the >>> relevance of a test. The results are not sufficient by themselves, but I >>> believe they point to an interesting direction. This strategy uses that >>> rate of co-occurrence of events to estimate the relevance of a test, and >>> the events that it uses are the following: >>> >>> - File editions since last run >>> - Test failure in last run >>> >>> >>> The strategy has also two stages: >>> >>> 1. Training stage >>> 2. Executing stage >>> >>> >>> In the training stage, it goes through the available data, and does the >>> following: >>> >>> - If test A failed: >>> - It counts and stores all the files that were edited since the last >>> >>> test_run (the last test_run depends on BRANCH, PLATFORM, and other >>> factors) >>> - If test A failed also in the previous test run, it also counts that >>> >>> >>> In the executing stage, the training algorithm is still applied, but the >>> decision of whether a test runs is based on its relevance, the relevance >>> is >>> calculated as the sum of the following: >>> >>> - The percentage of times a test has failed in two subsequent >>> >>> test_runs, multiplied by whether the test failed in the previous run >>> (if >>> the test didn't fail in the previous run, this quantity is 0) >>> - For each file that was edited since the last test_run, the >>> >>> percentage of times that the test has failed after this file was >>> edited >>> >>> (The explanation is a bit clumsy, I can clear it up if you wish so) >>> The results have not been too bad, nor too good. With a running set of >>> 200 >>> tests, a training phase of 3000 test runs, and an executing stage of 2000 >>> test runs, I have achieved recall of 0.50. It's not too great, nor too >>> bad. >>> >>> Nonetheless, while running tests, I found something interesting: >>> >>> - I removed the first factor of the relevance. I decided to not care >>> >>> about whether a test failed in the previous test run. I was only >>> using the >>> file-change factor. Naturally, the recall decreased, from 0.50 to >>> 0.39 (the >>> decrease was not too big)... and the distribution of failed tests in >>> the >>> priority queue had a good skew towards the front of the queue (so it >>> seems >>> that the files help somewhat, to indicate the likelihood of a >>> failure). I >>> attached this chart. >>> >>> An interesting problem that I encountered was that about 50% of the >>> test_runs don't have any file changes nor test failures, and so the >>> relevance of all tests is zero. Here is where the original strategy (a >>> weighted average of failures) could be useful, so that even if we don't >>> have any information to guess which tests to run, we just go ahead and >>> run >>> the ones that have failed the most, recently. >>> >>> I will work on mixing up both strategies a bit in the next few days, and >>> see what comes of that. >>> >>> By the way, I pushed the code to github. The code is completely >>> different, >>> so may be better to wait until I have new results soon. >>> >>> Regards! >>> Pablo >>> >>> >>
Attachment:
3k2k_strat_comparison.png
Description: PNG image
Attachment:
20k20k_strat_comparison.png
Description: PNG image
Thread Previous • Date Previous • Date Next • Thread Next |