maria-developers team mailing list archive
Mailing list archive
Re: [GSoC] Optimize mysql-test-runs - Setback
On 16.06.2014 10:57, Sergei Golubchik wrote:
Just one comment:
On Jun 16, Elena Stepanova wrote:
4. Failed tests vs executed tests
Further, as I understand you only calculate the metrics for tests which
were either edited, or failed at least once; and thus, only such tests
can ever make to a corresponding queue. Not only does it create a
bubble, but it also makes the comparison of modes faulty, and the whole
simulation less efficient.
About the bubble. Why is it bad? Because it decreases the recall - there
are test failures (namely, outside of the bubble) that we'll never see.
But because the whole purpose of this task is to optimize for a *high
recall* in a short testing time, everything that makes recall worse
needs to be analyzed.
I mean, this is important - the bubble isn't bad for itself, it's only
bad because it reduces the recall. If no strategy to break this bubble
will help to improve the recall - we shouldn't break it at all!
Right, and I want to see a proof that it really does *not* improve
recall, because I think it should. Currently we think that our recall is
a function of a running set, and we say -- okay, after N=100 it flattens
and doesn't improve much further. But it might well be that it flattens
simply because the queue doesn't get filled -- of course there will be
no difference between N=100 and N=500 if the queue is less than 100 anyway.
Then again, if recall is close to 100% either way, it might not be
a) I doubt it is. as Pablo said, the previous results were not accurate,
and from what I saw after we remove dependencies between simulation
runs, we should be somewhere below 50% with the mixed mode on N=500.
b) Unless I'm missing something, the bubble becomes critical if we add
lets say a new platform, because it does not allow to choose tests which
never failed on this platform, and the queue will be empty and the
platform won't be tested at all, at least until some tests get edited
(assuming we use the editing factor).
In any case, now the experiments provide results different from what we
think they do. If we want to compare the "full queue" effect with the
"non-full queue", lets make it another parameter.
On the other hand, perhaps you, Elena, think that missing a new test
failure in one of the test files that wasn't touched by a particular
revision - that missing such a failure is worse than missing some other
test failure? Because that's a regression and so on. If this is
the case, Pablo needs to change the fitness function he's optimizing.
Recall assigns equal weights to all test failures, missing one failure
is equally bad for all tests. If some failures are worse than others, a
different fitness function, uhm, let's call it "weighted recall", could
be used to adequally map your expectations into the model.
No, I wasn't thinking about it. I'm still staying withing the same
model, where all failures have equal weights.
On the contrary, my notes regarding "first time failures" vs "sporadic
failures" were supposed to say that we don't need to do anything
specific about sporadic failures, if they are caught then they are
caught, if not then not. Sorry if it wasn't clear.
I do however think that abandoning a test forever because it hadn't
failed for a long time is a wrong thing to do, but tools for dealing
with it are already in the model -- time factor and editing factor, and
they had been there from the beginning, they just need to be tuned
(editing needs to be fixed, and possibly time coefficient to be changed
if the current value doesn't provide good results -- that's something to
Again - if you think that optimizing the model doesn't do what we want
it to do, the way to fix it is not to add artificial heuristics and
rules into it, but to modify the model.
It means that even though you set the running set to 500, in fact
you'll only run 20 tests at most. It's not desirable -- if we say we
can afford running 500 tests, we'd rather run 500 than 20, even if
some of them never failed before. This will also help us break the bubble
Same as above. The model optimizes the recall as a function of test time
(ideally, that is). If it shows that running 20 tests produces the same
recall as running 500 tests - it should run 20 tests. Indeed, why should
it run more if it doesn't improve the recall?
Same as above, I think it will improve the recall, and most likely even
essentially, and at the very least we need to see the difference so we
can make an informed decision about it.
Although I expect that running 500 tests *will* improve the recall, of
course, even if only marginally.
Anyway, my whole point is - let's stay within the model and improve the
fitness function (which is recall at the moment). It's the only way to
see quantatively what every strategy gives and whether it should be used
It is still the same model.
The core of the model was to make recall a function of cutoff, right? So
lets try it first, lets make it real cutoff and see the results.
Not filling the queue completely (or in some cases having it empty) is
optimization over the initial model, which improves the execution time
(marginally) but affects recall (even only marginally). It can be
considered, but the results should be compared to the basic ones.
And if lets say we decide that N=100 (or N=10%) is the best cutoff
value, and then find out that by not filling the queue completely we
lose even 1% in recall,we might want to stay with the full queue. What
is the time difference between running 50 tests and 100 tests? Almost
nothing, especially comparing to what we spend on preparation of the
tests. So, if 100 tests vs 50 tests add 1% to recall, and also helps to
solve the problem of never-running-tests, i'd say it's better to stay
within the initial model.
That said, Pablo should try to do something about the bubble, I suppose.
E.g. run more tests and randomize the tail? And see whether it helps to
improve the recall.