← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Accepted student ready to work : )

 

Hi, Pablo!

On May 21, Pablo Estrada wrote:
> Hello Sergei and all,
> First of all, I'll explain quickly the terms that I was using:
> 
>    - *test_suite, test suite, test case* - When I say test suite or test
>    case, I am referring to a single test file. For instance '
>    *pbxt.group_min_max*'. They are the ones that fail, and whose failures
>    we want to attempt to predict.

may I suggest to distinguish between a test *suite* and a test *case*?
the latter is usually a one test file, but a suite (for mtr) is a
directory with many test files. Like, "main", "pbxt", etc.

>    - *test_run, test run* - When I use this term, I refer to an entry in
>    the *test_run* table of the database. A test run is a set of
>    *test_suites* that run together at a certain time.
> 
> I have in place now a basic script to do the simulations. I have tried to
> keep the code clear, and I will upload a repository to github soon.
> I have already run simulations on the data. The simulations used 2000
> test_runs as training data, and then attempted to predict behavior on the
> following 3000 test_runs. Of course, maybe a wider spectrum of data might
> be needed to truly asses the algorithm.
> 
> I used four different ways to calculate a 'relevancy index' for a test:
> 
>    1. Keep a relevancy index by test case
>    2. Keep a relevancy index by test case by platform
>    3. Keep a relevancy index by test case by branch
>    4. Keep a relevancy index by test case by branch by platform (mixed)
> 
> I graphed the results. The graph is attached. As can be seen from the
> graph, the platform and the mixed model proved to be the best for recall.
> I feel the results were quite similar to what Sergei encountered.

Right.

> I have not run the tests on a larger set of data (the data dump that I have
> available contains 200,000 test_runs, so in theory I could test the
> algorithm with all this data)... I feel that I want to consider a couple
> things before going on to big testing:
> 
> I feel that there is a bit of a potential fallacy in the model that I'm
> following. Here's why:
> The problem that I find in the model is that we don't know a-priori when a
> test will fail for the first time. Strictly speaking, in the model, if a
> test doesn't fail for the first time, it never starts running at all. In
> the implementation that I made, I am using the first failure of each test
> to start giving it a relevancy test (so the test would have to fail before
> it even qualifies to run).
> This results in a really high recall rate because it is natural that if a
> test fails once, it might fail pretty soon after, so although we might have
> missed the first failure, we still consider that we didn't miss it, and
> based on it we will catch the two or three failures that come right after.
> This inflates the recall rate of 'subsequent' failures, but it is not very
> helpful when trying to catch failures that are not part of a trend... I
> feel this is not realistic.
> 
> Here are changes that I'd like to incorporate to the model:
> 
>    1. The failure rate should stay, and should still be measured with
>    exponential decay or weighted average
>    2. Include a new measure that increases relevancy: Time since last run.
>    The relevancy index should have a component that makes the test more
>    relevant the longer it spends not running
>       1. A problem with this is that *test suites* that might have stopped
>       being used will stay and compete for resources, although in reality they
>       would not be relevant anymore
>    3. Include also correlation. I still don't have a great idea of how
>    correlation will be considered, but it's something like this:
>       1. The data contains the list of test_runs where each test_suite has
>       failed. If two test suites have failed together a certain percentage of
>       times (>30%?), then when test A fails, the relevancy test of test B also
>       goes up... and when test A runs without failing, the relevancy
> test of test
>       B goes down too.
>       2. Using only the times that tests fail together seems like a good
>       heuristic, without having to calculate the total correlation of all the
>       history of all the combinations of tests.
> 
> If these measures were to be incorporated, a couple of changes would also
> have to be considered:
> 
>    1. Failures that are* not spotted* *on a test_run* might be *able to be
>    spotted *on the *next* two or three or *N test_runs*? What do you think?
>    2. Considering these measures, probably *recall* will be *negatively
>    affected*, but I feel that the model would be *more realistic*.

I don't think you should introduce artificial limitations that make the
recall worse, because they "look realistic".

You can do it realistic instead, not look realistic - simply pretend
that your code is already running on buildbot and limits the number of
tests to run. So, if the test didn't run - you don't have any failure
information about it.

And then you only need to do what improves recall, nothing else :)

(of course, to calculate the recall you need to use all failures,
even for tests that you didn't run)

> Any input on my new suggestions? If all seems okay, I will proceed on to
> try to implement these.
> Also, I will soon upload the information so far to github. Can I also
> upload queries made to the database? Or are these private?

You mean the data tables? I think they're all public, they don't have
anything one couldn't get from http://buildbot.askmonty.org/

Regards,
Sergei



Follow ups

References