maria-developers team mailing list archive

Thread
Date

Re: [GSoC] Accepted student ready to work : )

To: Pablo Estrada <polecito.em@xxxxxxxxx>, Sergei Golubchik <serg@xxxxxxxxxxx>
From: Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
Date: Thu, 22 May 2014 11:12:44 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CABDWuanxPxsHxVuaZm=TT_gx6MqYENXyQN72a42teL7tXqCxkQ@mail.gmail.com>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0

Hi Pablo,

Thanks for the update, it looks like a great start.

Please see some comments inline, and I suppose Sergei will provide hisown if he has something to add or correct. Most controversial points ofmine I marked as "ATTN Sergei:".



On 21.05.2014 8:06, Pablo Estrada wrote:

Hello Sergei and all,
First of all, I'll explain quickly the terms that I was using:

    - *test_suite, test suite, test case* - When I say test suite or test
    case, I am referring to a single test file. For instance '
    *pbxt.group_min_max*'. They are the ones that fail, and whose failures
    we want to attempt to predict.
    - *test_run, test run* - When I use this term, I refer to an entry in
    the *test_run* table of the database. A test run is a set of
    *test_suites* that run together at a certain time.

Since your algorithm is expected to be used with MTR-based tests, Idon't think it's a good idea to redefine concepts that have veryparticular meaning in MTR universe, it's going to be very confusing forpeople who use it later.

Also, these concepts in MTR are close enough to the formal standarddefinitions, which is yet another reason to stick with them.

Test suite in MTR is a set of test cases under the same directory. Inyour example, pbxt is a test suite. (Also, all MTR test cases arecollectively called "MTR test suite", but it's irrelevant to yourpurposes).

Test case in MTR is what is represented by individual X.test/X.resultfiles and, optionally, some other related files. In your examplegroup_min_max is a test case from pbxt test suite.


I suggest to stay with the terminology, for clarity.


I have in place now a basic script to do the simulations. I have tried to
keep the code clear, and I will upload a repository to github soon.
I have already run simulations on the data. The simulations used 2000
test_runs as training data, and then attempted to predict behavior on the
following 3000 test_runs. Of course, maybe a wider spectrum of data might
be needed to truly asses the algorithm.

I used four different ways to calculate a 'relevancy index' for a test:

    1. Keep a relevancy index by test case
    2. Keep a relevancy index by test case by platform
    3. Keep a relevancy index by test case by branch
    4. Keep a relevancy index by test case by branch by platform (mixed)

I graphed the results. The graph is attached. As can be seen from the
graph, the platform and the mixed model proved to be the best for recall.
I feel the results were quite similar to what Sergei encountered.


The results look as expected, which is good.

Logically they are probably wrong -- given the nature of the tests, thebranch should be a more important factor than the platform. But it's notyour algorithm that fails, it's the data that is full offalse-positives, which are impossible to separate from the realfailures, and which indeed tend to be platform-dependent.

But even on an ideal data set the mixed approach should still be mostefficient, so it should be okay to use it even if some day we fix allthe broken tests and collect reliable data.


I have not run the tests on a larger set of data (the data dump that I have
available contains 200,000 test_runs, so in theory I could test the
algorithm with all this data)... I feel that I want to consider a couple
things before going on to big testing:

I feel that there is a bit of a potential fallacy in the model that I'm
following. Here's why:
The problem that I find in the model is that we don't know a-priori when a
test will fail for the first time. Strictly speaking, in the model, if a
test doesn't fail for the first time, it never starts running at all. In
the implementation that I made, I am using the first failure of each test
to start giving it a relevancy test (so the test would have to fail before
it even qualifies to run).
This results in a really high recall rate because it is natural that if a
test fails once, it might fail pretty soon after, so although we might have
missed the first failure, we still consider that we didn't miss it, and
based on it we will catch the two or three failures that come right after.
This inflates the recall rate of 'subsequent' failures, but it is not very
helpful when trying to catch failures that are not part of a trend... I
feel this is not realistic.

I don't quite get this part. What does it mean, "missed the firstfailure"? Missed how/why?


*ATTN Sergei:*

Anyway, the problem seems to originate from the fact that we do not havea complete list of tests that had ever been run, which you mentioned inthe previous email, and which we should look into fixing. If that'sdone, you won't have to rely on the first failure to start calculatingrelevancy, you will do it from the first test run.


Here are changes that I'd like to incorporate to the model:

    1. The failure rate should stay, and should still be measured with
    exponential decay or weighted average
    2. Include a new measure that increases relevancy: Time since last run.
    The relevancy index should have a component that makes the test more
    relevant the longer it spends not running


I agree with the idea, but have doubts about the criteria.

I think you should measure not the time, but the number of test runsthat happened since the last time the test was run (it would be evenbetter if we could count the number of revisions, but that's probablynot easy).The reason is that some branches are very active, while others can beextremely slow. So, with the same time-based coefficient the relevancyof a test can strike between two consequent test runs just because theyhappened with a month interval, but will be changing too slowly on abranch which has a dozen of commits a day.

       1. A problem with this is that *test suites* that might have stopped
       being used will stay and compete for resources, although in reality they
       would not be relevant anymore


*ATTN Sergei:*

I think in any case we'll have to rely on the fact that your script willchoose tests not from the whole universe of tests, but from an initiallist that MTR produces for this particular test run. That is, it will gosomething like that:

- test run is started in buildbot;

- MTR collects test cases to run, according to the startup parameters,as it always does;

- the list is passed to your script;

- the script filters it according to the algorithm that you developed,keeps only a small portion of the initial list, and passes it back to MTR;

- MTR runs the requested tests.

That is, you do exclusion of tests rather than inclusion.

This will solve two problems:

- first test run: when a new test is added, only MTR knows about it,buildbot doesn't; so, when MTR passes to you a test that you knownothing about (and assuming that we do have a list of all executed testsin buildbot), you'll know it's a new test and will act accordingly;- abandoned tests: MTR just won't pass them to your script, so it won'ttake them into account.

    3. Include also correlation. I still don't have a great idea of how
    correlation will be considered, but it's something like this:
       1. The data contains the list of test_runs where each test_suite has
       failed. If two test suites have failed together a certain percentage of
       times (>30%?), then when test A fails, the relevancy test of test B also
       goes up... and when test A runs without failing, the relevancy
test of test
       B goes down too.


We'll need to see how it goes.

In real life correlation of this kind does exist, but I'd say much moreoften related failures happen due to some environmental problems, so thepresumed correlation will be fake.

       2. Using only the times that tests fail together seems like a good
       heuristic, without having to calculate the total correlation of all the
       history of all the combinations of tests.

If these measures were to be incorporated, a couple of changes would also
have to be considered:

    1. Failures that are* not spotted* *on a test_run* might be *able to be
    spotted *on the *next* two or three or *N test_runs*? What do you think?


Yes, it can happen and does happen often.

It is not intended -- MTR tests are supposed to be deterministic, ifthere is a reason to fail, they should always fail. Unfortunately, itdoes not always work like that, and failures, including real ones, canappear sporadically. I don't think we need to put a lot of effort intodealing with it, but certainly a test should not be excluded after justone successful run, there should be some N>1. I'd say 5-10 would bereasonable.

    2. Considering these measures, probably *recall* will be *negatively
    affected*, but I feel that the model would be *more realistic*.

Any input on my new suggestions? If all seems okay, I will proceed on to
try to implement these.
Also, I will soon upload the information so far to github. Can I also
upload queries made to the database? Or are these private?

I saw your other email, but didn't look into the repo yet, will do soon-- I didn't want to delay this email even further.


*ATTN Sergei*

I don't think there is any private information in the queries (or in thedata, for that matter -- it all can be seen via public web interfaceanyway).


Regards,
Elena


Regards
Pablo


On Wed, May 7, 2014 at 7:41 PM, Sergei Golubchik <serg@xxxxxxxxxxx> wrote:

Hi, Pablo!

On May 07, Pablo Estrada wrote:


So here's what I'll do for the simulations:

*1. Calculating the: "Relevancy index"* for a test, I have considered two
simple options so far:

    - *Exponential decay*: The relevancy index of a test is the *sum over
    each failure* of( *exp((FailureTime - CurrentTime)/DecayRate))*. It
    decreases exponentially as time passes, and increases if the test

fails.

       - DecayRate is
       - i.e. If TestA failed at days 5 and 7, and now is day 9, RI will
       be (exp(5-9)+exp(7-9)) = (exp(-4)+exp(-2)).
       - The unit to measure time is just seconds in UNIX_TIMESTAMP
    - *Weighted moving average*: The relevancy index of a test is:

*R[now] =

    R[now-1]*alpha + fail*(1-alpha)*, where fail is 1 if the test failed

in

    this run, and 0 if it did not fail. The value is between 1 and 0. It
    decreases slowly if a test runs without failing, and it increases

slowly if

    the test fails.
       - 0 < alpha < 1 (Initially set at 0.95 for testing).
       - i.e. If TestB failed for the first time in the last run, and

again

       in this run: R[t] = 1*0.95 + 1*0.5 = 1
       - If test B ran once more and did not fail, then: R[t+1] = 1*0.95 +
       0*0.5 = 0.95
       - The *advantage of this method* is that it doesn't have to look at
       the whole history every time it's calculated (unlike the

exponential decay

       method)


you don't need to look at the whole history for the exponential decay.
Because it is

   exp((FailureTime - CurrentTime)/DecayRate)

You simply have

   R[t] = exp(FailureTime/DecayRate) / exp(t/DecayRate)
   R[t+1] = R[t] / exp(1/DecayRate)   (if there was no failure)

       - Much like TCP protocol (

http://www.cl.cam.ac.uk/~jac22/books/mm/book/node153.html)


Regarding the *Relevancy Index*, it can be calculated grouping test

results

in many ways: *Roughly* using test_name+variation, or *more granularly*

by

*including* *branch* and *platform*. I'll add some thoughts regarding

these

options at the bottom of the email.


I've tested these options earlier, you may want to try them all too and
see which one delivers better results.

*2. *To* run the simulation*, I'll gather data from the first few

thousands

of test_run entries, and then start simulating results. Here's what I'll

do:


    1. *Gather data *first few thousands of test_run entries (i.e. 4
    thousand)
    2. After N thousand test_runs, I'll go through the test_run entries

*one

    by one*, and using the data gathered to that point, I will select

'*running

    sets*' of *100* *test suites* to run on each test_run entry. (The

number

    can be adjusted)


Absolutely, it should be. This is the main parameter we can tune, after
all. The larger your running set is, the better will be the recall.

May be not now but later, but it would be very useful to see these
graphs, recall as a function of the running set size. It's important to
know whether by increasing the running set by 10% we get 1% recall
increase of 70% recall increase (as you've seen, there's a region when
recall increases very fast as the running set grows).

    3. If in this *test_run* entry, the list of *failed tests* contains
    tests that are *NOT part* of the *running set*, the failure will be
    ignored, and so the information of this failure will be lost (not

used as

    part of the relevancy index). *(See Comment 2)*
    4. If the set of *failed tests *in the *test_run* entry intersect with
    the *running_set*, this is better *recall*. This information will be
    used to continue calculating the *relevancy index*.


Could you explain the terminology you're using?
What is a "test suite" and what is a "test run"?
How will you calculate the "recall"?

According to the results obtained from the simulations, we can adjust the
algorithm (i.e. to consider *relevancy index by* *platform* and *branch*,
etc.)

Comments about the *relevancy index:*

    - The methods to calculate the relevancy index are very simple. There
    are some other useful metrics that could be incorporated
       - *Time since last run. *With the current methods, if a*
test*completely *stops
       running*, it only* becomes less relevant with time*, and so even if
       it could expose defects, it doesn't get to run because its
relevancy index
       is just going down. Incorporating a function that* increases the
       relevancy index* as the *time since the last run* *increases* can
       help solve this issue. I believe this measure will be useful.


Right. I will not comment on this measure now.
But I absolutely agree that this is an issue that must be solved.

       - *Correlation between test failures*. If two tests tend to fail
       together, is it better to just run one of them? Incorporating
this measure
       seems difficult, but it is on the table, in case we should

consider it.

Agree. Taking correlations into account looks very promising, but it
does seem to be difficult :)

    - As you might have seen, I decided to not consider any data concerned
    with *code changes*. I'll work like this and see if the results are
    satisfactory.


Right!

Comments regarding *buildbot infrasturcture:*
These comments are out of the scope of this project, but it would be very
desirable features for the buildbot infrastructure.

    - Unfortunately, given the data available in the database, it is NOT
    possible to know *which tests ran* on each *test_run*. This

information

    would be very useful, as it would help estimate the *exact failure
rate*of a test. I didn't look into the code, but it seems that *class
    MtrLogObserver*(

http://buildbot.sourcearchive.com/documentation/0.8.3p1-1/mtrlogobserver_8py_source.html
)

contains
    most of the infrastructure necessary to just add one or two more

tables (

    *test_suite*, and *test_suite_test_run*), some code, and start keeping
    track of this information.
    - Another problem with the data available in the database is that it

is

    not possible to know *how many test suites exist*. It is only possible
    to estimate *how many different test suites have failed*. This would
    also be helpful information.
    - Actually, this information would be useful not only for this

project,

    but in general for book-keeping of the development of MariaDB.


Regards,
Sergei

Follow ups

Re: [GSoC] Accepted student ready to work : )
From: Sergei Golubchik, 2014-05-22

References

[GSoC] Accepted student ready to work : )
From: Pablo Estrada, 2014-04-22
Re: [GSoC] Accepted student ready to work : )
From: Sergei Golubchik, 2014-04-22
Re: [GSoC] Accepted student ready to work : )
From: Pablo Estrada, 2014-04-23
Re: [GSoC] Accepted student ready to work : )
From: Pablo Estrada, 2014-04-27
Re: [GSoC] Accepted student ready to work : )
From: Sergei Golubchik, 2014-04-28
Re: [GSoC] Accepted student ready to work : )
From: Kristian Nielsen, 2014-04-28
Re: [GSoC] Accepted student ready to work : )
From: Sergei Golubchik, 2014-04-28
Re: [GSoC] Accepted student ready to work : )
From: Pablo Estrada, 2014-05-07
Re: [GSoC] Accepted student ready to work : )
From: Sergei Golubchik, 2014-05-07
Re: [GSoC] Accepted student ready to work : )
From: Pablo Estrada, 2014-05-21