← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Accepted student ready to work : )

 

Here's the repository with the code I have in place so far. It's pretty
simple stuff:
https://github.com/pabloem/Kokiri


On Wed, May 21, 2014 at 1:06 PM, Pablo Estrada <polecito.em@xxxxxxxxx>wrote:

> Hello Sergei and all,
> First of all, I'll explain quickly the terms that I was using:
>
>    - *test_suite, test suite, test case* - When I say test suite or test
>    case, I am referring to a single test file. For instance '
>    *pbxt.group_min_max*'. They are the ones that fail, and whose failures
>    we want to attempt to predict.
>    - *test_run, test run* - When I use this term, I refer to an entry in
>    the *test_run* table of the database. A test run is a set of
>    *test_suites* that run together at a certain time.
>
> I have in place now a basic script to do the simulations. I have tried to
> keep the code clear, and I will upload a repository to github soon.
> I have already run simulations on the data. The simulations used 2000
> test_runs as training data, and then attempted to predict behavior on the
> following 3000 test_runs. Of course, maybe a wider spectrum of data might
> be needed to truly asses the algorithm.
>
> I used four different ways to calculate a 'relevancy index' for a test:
>
>    1. Keep a relevancy index by test case
>    2. Keep a relevancy index by test case by platform
>    3. Keep a relevancy index by test case by branch
>    4. Keep a relevancy index by test case by branch by platform (mixed)
>
> I graphed the results. The graph is attached. As can be seen from the
> graph, the platform and the mixed model proved to be the best for recall.
> I feel the results were quite similar to what Sergei encountered.
>
> I have not run the tests on a larger set of data (the data dump that I
> have available contains 200,000 test_runs, so in theory I could test the
> algorithm with all this data)... I feel that I want to consider a couple
> things before going on to big testing:
>
> I feel that there is a bit of a potential fallacy in the model that I'm
> following. Here's why:
> The problem that I find in the model is that we don't know a-priori when a
> test will fail for the first time. Strictly speaking, in the model, if a
> test doesn't fail for the first time, it never starts running at all. In
> the implementation that I made, I am using the first failure of each test
> to start giving it a relevancy test (so the test would have to fail before
> it even qualifies to run).
>  This results in a really high recall rate because it is natural that if
> a test fails once, it might fail pretty soon after, so although we might
> have missed the first failure, we still consider that we didn't miss it,
> and based on it we will catch the two or three failures that come right
> after.
> This inflates the recall rate of 'subsequent' failures, but it is not very
> helpful when trying to catch failures that are not part of a trend... I
> feel this is not realistic.
>
> Here are changes that I'd like to incorporate to the model:
>
>    1. The failure rate should stay, and should still be measured with
>    exponential decay or weighted average
>    2. Include a new measure that increases relevancy: Time since last
>    run. The relevancy index should have a component that makes the test more
>    relevant the longer it spends not running
>       1. A problem with this is that *test suites* that might have
>       stopped being used will stay and compete for resources, although in reality
>       they would not be relevant anymore
>    3. Include also correlation. I still don't have a great idea of how
>    correlation will be considered, but it's something like this:
>       1. The data contains the list of test_runs where each test_suite
>       has failed. If two test suites have failed together a certain percentage of
>       times (>30%?), then when test A fails, the relevancy test of test B also
>       goes up... and when test A runs without failing, the relevancy test of test
>       B goes down too.
>       2. Using only the times that tests fail together seems like a good
>       heuristic, without having to calculate the total correlation of all the
>       history of all the combinations of tests.
>
> If these measures were to be incorporated, a couple of changes would also
> have to be considered:
>
>    1. Failures that are* not spotted* *on a test_run* might be *able to
>    be spotted *on the *next* two or three or *N test_runs*? What do you
>    think?
>    2. Considering these measures, probably *recall* will be *negatively
>    affected*, but I feel that the model would be *more realistic*.
>
> Any input on my new suggestions? If all seems okay, I will proceed on to
> try to implement these.
> Also, I will soon upload the information so far to github. Can I also
> upload queries made to the database? Or are these private?
>
> Regards
> Pablo
>
>
> On Wed, May 7, 2014 at 7:41 PM, Sergei Golubchik <serg@xxxxxxxxxxx> wrote:
>
>> Hi, Pablo!
>>
>> On May 07, Pablo Estrada wrote:
>> >
>> > So here's what I'll do for the simulations:
>> >
>> > *1. Calculating the: "Relevancy index"* for a test, I have considered
>> two
>> > simple options so far:
>> >
>> >    - *Exponential decay*: The relevancy index of a test is the *sum over
>> >    each failure* of( *exp((FailureTime - CurrentTime)/DecayRate))*. It
>> >    decreases exponentially as time passes, and increases if the test
>> fails.
>> >       - DecayRate is
>> >       - i.e. If TestA failed at days 5 and 7, and now is day 9, RI will
>> >       be (exp(5-9)+exp(7-9)) = (exp(-4)+exp(-2)).
>> >       - The unit to measure time is just seconds in UNIX_TIMESTAMP
>> >    - *Weighted moving average*: The relevancy index of a test is:
>> *R[now] =
>> >    R[now-1]*alpha + fail*(1-alpha)*, where fail is 1 if the test failed
>> in
>> >    this run, and 0 if it did not fail. The value is between 1 and 0. It
>> >    decreases slowly if a test runs without failing, and it increases
>> slowly if
>> >    the test fails.
>> >       - 0 < alpha < 1 (Initially set at 0.95 for testing).
>> >       - i.e. If TestB failed for the first time in the last run, and
>> again
>> >       in this run: R[t] = 1*0.95 + 1*0.5 = 1
>> >       - If test B ran once more and did not fail, then: R[t+1] = 1*0.95
>> +
>> >       0*0.5 = 0.95
>> >       - The *advantage of this method* is that it doesn't have to look
>> at
>> >       the whole history every time it's calculated (unlike the
>> exponential decay
>> >       method)
>>
>> you don't need to look at the whole history for the exponential decay.
>> Because it is
>>
>>   exp((FailureTime - CurrentTime)/DecayRate)
>>
>> You simply have
>>
>>   R[t] = exp(FailureTime/DecayRate) / exp(t/DecayRate)
>>   R[t+1] = R[t] / exp(1/DecayRate)   (if there was no failure)
>>
>> >       - Much like TCP protocol (
>> http://www.cl.cam.ac.uk/~jac22/books/mm/book/node153.html)
>> >
>> > Regarding the *Relevancy Index*, it can be calculated grouping test
>> results
>> > in many ways: *Roughly* using test_name+variation, or *more granularly*
>> by
>> > *including* *branch* and *platform*. I'll add some thoughts regarding
>> these
>> > options at the bottom of the email.
>>
>> I've tested these options earlier, you may want to try them all too and
>> see which one delivers better results.
>>
>> > *2. *To* run the simulation*, I'll gather data from the first few
>> thousands
>> > of test_run entries, and then start simulating results. Here's what
>> I'll do:
>> >
>> >    1. *Gather data *first few thousands of test_run entries (i.e. 4
>> >    thousand)
>> >    2. After N thousand test_runs, I'll go through the test_run entries
>> *one
>> >    by one*, and using the data gathered to that point, I will select
>> '*running
>> >    sets*' of *100* *test suites* to run on each test_run entry. (The
>> number
>> >    can be adjusted)
>>
>> Absolutely, it should be. This is the main parameter we can tune, after
>> all. The larger your running set is, the better will be the recall.
>>
>> May be not now but later, but it would be very useful to see these
>> graphs, recall as a function of the running set size. It's important to
>> know whether by increasing the running set by 10% we get 1% recall
>> increase of 70% recall increase (as you've seen, there's a region when
>> recall increases very fast as the running set grows).
>>
>> >    3. If in this *test_run* entry, the list of *failed tests* contains
>> >    tests that are *NOT part* of the *running set*, the failure will be
>> >    ignored, and so the information of this failure will be lost (not
>> used as
>> >    part of the relevancy index). *(See Comment 2)*
>> >    4. If the set of *failed tests *in the *test_run* entry intersect
>> with
>> >    the *running_set*, this is better *recall*. This information will be
>> >    used to continue calculating the *relevancy index*.
>>
>> Could you explain the terminology you're using?
>> What is a "test suite" and what is a "test run"?
>> How will you calculate the "recall"?
>>
>> > According to the results obtained from the simulations, we can adjust
>> the
>> > algorithm (i.e. to consider *relevancy index by* *platform* and
>> *branch*,
>> > etc.)
>> >
>> > Comments about the *relevancy index:*
>> >
>> >    - The methods to calculate the relevancy index are very simple. There
>> >    are some other useful metrics that could be incorporated
>> >       - *Time since last run. *With the current methods, if a*
>> > test*completely *stops
>> >       running*, it only* becomes less relevant with time*, and so even
>> if
>> >       it could expose defects, it doesn't get to run because its
>> > relevancy index
>> >       is just going down. Incorporating a function that* increases the
>> >       relevancy index* as the *time since the last run* *increases* can
>> >       help solve this issue. I believe this measure will be useful.
>>
>> Right. I will not comment on this measure now.
>> But I absolutely agree that this is an issue that must be solved.
>>
>> >       - *Correlation between test failures*. If two tests tend to fail
>> >       together, is it better to just run one of them? Incorporating
>> > this measure
>> >       seems difficult, but it is on the table, in case we should
>> consider it.
>>
>> Agree. Taking correlations into account looks very promising, but it
>> does seem to be difficult :)
>>
>> >    - As you might have seen, I decided to not consider any data
>> concerned
>> >    with *code changes*. I'll work like this and see if the results are
>> >    satisfactory.
>>
>> Right!
>>
>> > Comments regarding *buildbot infrasturcture:*
>> > These comments are out of the scope of this project, but it would be
>> very
>> > desirable features for the buildbot infrastructure.
>> >
>> >    - Unfortunately, given the data available in the database, it is NOT
>> >    possible to know *which tests ran* on each *test_run*. This
>> information
>> >    would be very useful, as it would help estimate the *exact failure
>> > rate*of a test. I didn't look into the code, but it seems that *class
>> >    MtrLogObserver*(
>> http://buildbot.sourcearchive.com/documentation/0.8.3p1-1/mtrlogobserver_8py_source.html
>> )
>> > contains
>> >    most of the infrastructure necessary to just add one or two more
>> tables (
>> >    *test_suite*, and *test_suite_test_run*), some code, and start
>> keeping
>> >    track of this information.
>> >    - Another problem with the data available in the database is that it
>> is
>> >    not possible to know *how many test suites exist*. It is only
>> possible
>> >    to estimate *how many different test suites have failed*. This would
>> >    also be helpful information.
>> >    - Actually, this information would be useful not only for this
>> project,
>> >    but in general for book-keeping of the development of MariaDB.
>>
>> Regards,
>> Sergei
>>
>
>

References