maria-developers team mailing list archive
Mailing list archive
Re: [GSoC] Optimize mysql-test-runs - Setback
Just one comment:
On Jun 16, Elena Stepanova wrote:
> 4. Failed tests vs executed tests
> Further, as I understand you only calculate the metrics for tests which
> were either edited, or failed at least once; and thus, only such tests
> can ever make to a corresponding queue. Not only does it create a
> bubble, but it also makes the comparison of modes faulty, and the whole
> simulation less efficient.
About the bubble. Why is it bad? Because it decreases the recall - there
are test failures (namely, outside of the bubble) that we'll never see.
But because the whole purpose of this task is to optimize for a *high
recall* in a short testing time, everything that makes recall worse
needs to be analyzed.
I mean, this is important - the bubble isn't bad for itself, it's only
bad because it reduces the recall. If no strategy to break this bubble
will help to improve the recall - we shouldn't break it at all!
On the other hand, perhaps you, Elena, think that missing a new test
failure in one of the test files that wasn't touched by a particular
revision - that missing such a failure is worse than missing some other
test failure? Because that's a regression and so on. If this is
the case, Pablo needs to change the fitness function he's optimizing.
Recall assigns equal weights to all test failures, missing one failure
is equally bad for all tests. If some failures are worse than others, a
different fitness function, uhm, let's call it "weighted recall", could
be used to adequally map your expectations into the model.
Again - if you think that optimizing the model doesn't do what we want
it to do, the way to fix it is not to add artificial heuristics and
rules into it, but to modify the model.
> It means that even though you set the running set to 500, in fact
> you'll only run 20 tests at most. It's not desirable -- if we say we
> can afford running 500 tests, we'd rather run 500 than 20, even if
> some of them never failed before. This will also help us break the bubble
Same as above. The model optimizes the recall as a function of test time
(ideally, that is). If it shows that running 20 tests produces the same
recall as running 500 tests - it should run 20 tests. Indeed, why should
it run more if it doesn't improve the recall?
Although I expect that running 500 tests *will* improve the recall, of
course, even if only marginally.
Anyway, my whole point is - let's stay within the model and improve the
fitness function (which is recall at the moment). It's the only way to
see quantatively what every strategy gives and whether it should be used
That said, Pablo should try to do something about the bubble, I suppose.
E.g. run more tests and randomize the tail? And see whether it helps to
improve the recall.