maria-developers team mailing list archive
-
maria-developers team
-
Mailing list archive
-
Message #07275
Re: [GSoC] Accepted student ready to work : )
Hi, Pablo!
On May 21, Pablo Estrada wrote:
> Hello Sergei and all,
> First of all, I'll explain quickly the terms that I was using:
>
> - *test_suite, test suite, test case* - When I say test suite or test
> case, I am referring to a single test file. For instance '
> *pbxt.group_min_max*'. They are the ones that fail, and whose failures
> we want to attempt to predict.
may I suggest to distinguish between a test *suite* and a test *case*?
the latter is usually a one test file, but a suite (for mtr) is a
directory with many test files. Like, "main", "pbxt", etc.
> - *test_run, test run* - When I use this term, I refer to an entry in
> the *test_run* table of the database. A test run is a set of
> *test_suites* that run together at a certain time.
>
> I have in place now a basic script to do the simulations. I have tried to
> keep the code clear, and I will upload a repository to github soon.
> I have already run simulations on the data. The simulations used 2000
> test_runs as training data, and then attempted to predict behavior on the
> following 3000 test_runs. Of course, maybe a wider spectrum of data might
> be needed to truly asses the algorithm.
>
> I used four different ways to calculate a 'relevancy index' for a test:
>
> 1. Keep a relevancy index by test case
> 2. Keep a relevancy index by test case by platform
> 3. Keep a relevancy index by test case by branch
> 4. Keep a relevancy index by test case by branch by platform (mixed)
>
> I graphed the results. The graph is attached. As can be seen from the
> graph, the platform and the mixed model proved to be the best for recall.
> I feel the results were quite similar to what Sergei encountered.
Right.
> I have not run the tests on a larger set of data (the data dump that I have
> available contains 200,000 test_runs, so in theory I could test the
> algorithm with all this data)... I feel that I want to consider a couple
> things before going on to big testing:
>
> I feel that there is a bit of a potential fallacy in the model that I'm
> following. Here's why:
> The problem that I find in the model is that we don't know a-priori when a
> test will fail for the first time. Strictly speaking, in the model, if a
> test doesn't fail for the first time, it never starts running at all. In
> the implementation that I made, I am using the first failure of each test
> to start giving it a relevancy test (so the test would have to fail before
> it even qualifies to run).
> This results in a really high recall rate because it is natural that if a
> test fails once, it might fail pretty soon after, so although we might have
> missed the first failure, we still consider that we didn't miss it, and
> based on it we will catch the two or three failures that come right after.
> This inflates the recall rate of 'subsequent' failures, but it is not very
> helpful when trying to catch failures that are not part of a trend... I
> feel this is not realistic.
>
> Here are changes that I'd like to incorporate to the model:
>
> 1. The failure rate should stay, and should still be measured with
> exponential decay or weighted average
> 2. Include a new measure that increases relevancy: Time since last run.
> The relevancy index should have a component that makes the test more
> relevant the longer it spends not running
> 1. A problem with this is that *test suites* that might have stopped
> being used will stay and compete for resources, although in reality they
> would not be relevant anymore
> 3. Include also correlation. I still don't have a great idea of how
> correlation will be considered, but it's something like this:
> 1. The data contains the list of test_runs where each test_suite has
> failed. If two test suites have failed together a certain percentage of
> times (>30%?), then when test A fails, the relevancy test of test B also
> goes up... and when test A runs without failing, the relevancy
> test of test
> B goes down too.
> 2. Using only the times that tests fail together seems like a good
> heuristic, without having to calculate the total correlation of all the
> history of all the combinations of tests.
>
> If these measures were to be incorporated, a couple of changes would also
> have to be considered:
>
> 1. Failures that are* not spotted* *on a test_run* might be *able to be
> spotted *on the *next* two or three or *N test_runs*? What do you think?
> 2. Considering these measures, probably *recall* will be *negatively
> affected*, but I feel that the model would be *more realistic*.
I don't think you should introduce artificial limitations that make the
recall worse, because they "look realistic".
You can do it realistic instead, not look realistic - simply pretend
that your code is already running on buildbot and limits the number of
tests to run. So, if the test didn't run - you don't have any failure
information about it.
And then you only need to do what improves recall, nothing else :)
(of course, to calculate the recall you need to use all failures,
even for tests that you didn't run)
> Any input on my new suggestions? If all seems okay, I will proceed on to
> try to implement these.
> Also, I will soon upload the information so far to github. Can I also
> upload queries made to the database? Or are these private?
You mean the data tables? I think they're all public, they don't have
anything one couldn't get from http://buildbot.askmonty.org/
Regards,
Sergei
Follow ups
References