maria-developers team mailing list archive

Thread
Date

Re: [GSoC] Optimize mysql-test-runs - Setback

To: Sergei Golubchik <serg@xxxxxxxxxxx>
From: Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
Date: Mon, 16 Jun 2014 12:12:19 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20140616065725.GA12337@meddwl.fritz.box>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

Hi Sergei,

On 16.06.2014 10:57, Sergei Golubchik wrote:

Hi, Elena!

Just one comment:

On Jun 16, Elena Stepanova wrote:

4. Failed tests vs executed tests

Further, as I understand you only calculate the metrics for tests which
were either edited, or failed at least once; and thus, only such tests
can ever make to a corresponding queue. Not only does it create a
bubble, but it also makes the comparison of modes faulty, and the whole
simulation less efficient.


About the bubble. Why is it bad? Because it decreases the recall - there
are test failures (namely, outside of the bubble) that we'll never see.

But because the whole purpose of this task is to optimize for a *high
recall* in a short testing time, everything that makes recall worse
needs to be analyzed.

I mean, this is important - the bubble isn't bad for itself, it's only
bad because it reduces the recall. If no strategy to break this bubble
will help to improve the recall - we shouldn't break it at all!

Right, and I want to see a proof that it really does *not* improverecall, because I think it should. Currently we think that our recall isa function of a running set, and we say -- okay, after N=100 it flattensand doesn't improve much further. But it might well be that it flattenssimply because the queue doesn't get filled -- of course there will beno difference between N=100 and N=500 if the queue is less than 100 anyway.

Then again, if recall is close to 100% either way, it might not beimportant, but

a) I doubt it is. as Pablo said, the previous results were not accurate,and from what I saw after we remove dependencies between simulationruns, we should be somewhere below 50% with the mixed mode on N=500.

b) Unless I'm missing something, the bubble becomes critical if we addlets say a new platform, because it does not allow to choose tests whichnever failed on this platform, and the queue will be empty and theplatform won't be tested at all, at least until some tests get edited(assuming we use the editing factor).

In any case, now the experiments provide results different from what wethink they do. If we want to compare the "full queue" effect with the"non-full queue", lets make it another parameter.


On the other hand, perhaps you, Elena, think that missing a new test
failure in one of the test files that wasn't touched by a particular
revision - that missing such a failure is worse than missing some other
test failure? Because that's a regression and so on. If this is
the case, Pablo needs to change the fitness function he's optimizing.
Recall assigns equal weights to all test failures, missing one failure
is equally bad for all tests. If some failures are worse than others, a
different fitness function, uhm, let's call it "weighted recall", could
be used to adequally map your expectations into the model.

No, I wasn't thinking about it. I'm still staying withing the samemodel, where all failures have equal weights.

On the contrary, my notes regarding "first time failures" vs "sporadicfailures" were supposed to say that we don't need to do anythingspecific about sporadic failures, if they are caught then they arecaught, if not then not. Sorry if it wasn't clear.

I do however think that abandoning a test forever because it hadn'tfailed for a long time is a wrong thing to do, but tools for dealingwith it are already in the model -- time factor and editing factor, andthey had been there from the beginning, they just need to be tuned(editing needs to be fixed, and possibly time coefficient to be changedif the current value doesn't provide good results -- that's something toexperiment with).


Again - if you think that optimizing the model doesn't do what we want
it to do, the way to fix it is not to add artificial heuristics and
rules into it, but to modify the model.

It means that even though you set the running set to 500, in fact
you'll only run 20 tests at most. It's not desirable -- if we say we
can afford running 500 tests, we'd rather run 500 than 20, even if
some of them never failed before. This will also help us break the bubble


Same as above. The model optimizes the recall as a function of test time
(ideally, that is). If it shows that running 20 tests produces the same
recall as running 500 tests - it should run 20 tests. Indeed, why should
it run more if it doesn't improve the recall?

Same as above, I think it will improve the recall, and most likely evenessentially, and at the very least we need to see the difference so wecan make an informed decision about it.


Although I expect that running 500 tests *will* improve the recall, of
course, even if only marginally.

Anyway, my whole point is - let's stay within the model and improve the
fitness function (which is recall at the moment). It's the only way to
see quantatively what every strategy gives and whether it should be used
at all.


It is still the same model.

The core of the model was to make recall a function of cutoff, right? Solets try it first, lets make it real cutoff and see the results.Not filling the queue completely (or in some cases having it empty) isoptimization over the initial model, which improves the execution time(marginally) but affects recall (even only marginally). It can beconsidered, but the results should be compared to the basic ones.

And if lets say we decide that N=100 (or N=10%) is the best cutoffvalue, and then find out that by not filling the queue completely welose even 1% in recall,we might want to stay with the full queue. Whatis the time difference between running 50 tests and 100 tests? Almostnothing, especially comparing to what we spend on preparation of thetests. So, if 100 tests vs 50 tests add 1% to recall, and also helps tosolve the problem of never-running-tests, i'd say it's better to staywithin the initial model.


Regards,
Elena


That said, Pablo should try to do something about the bubble, I suppose.
E.g. run more tests and randomize the tail? And see whether it helps to
improve the recall.

Regards,
Sergei

Follow ups

Re: [GSoC] Optimize mysql-test-runs - Setback
From: Sergei Golubchik, 2014-06-16

References

[GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-12
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Elena Stepanova, 2014-06-12
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-13
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Elena Stepanova, 2014-06-16
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Sergei Golubchik, 2014-06-16