← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy

 

Hi Elena,
I fixed up the repositories with updated versions of the queries, as well
as instructions in the README on how to generate them.

Now I am looking a bit at the buildbot code, just to try to suggest some
design ideas for adding the statistician and the pythia into the MTR
related classes.

Regards
Pablo


On Sun, Aug 3, 2014 at 10:51 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
wrote:

> Hi Elena,
>
>> First, about integration. I agree it makes much more sense to store
>> metrics rather than calculate them each time from scratch. As an additional
>> bonus, it becomes unimportant whether we make buildbot store lists of all
>> executed tests as we initially discussed. If the tool stores metrics, it
>> only needs to receive the list of tests at runtime, which seems much less
>> invasive.
>>
>
> I see your point. I still think there is a benefit to having a list of run
> tests: If we want to run simulations or analyze other algorithms with the
> data, rather than just storing the test_info dictionary, having the list of
> tests that ran would be more useful, as we could actually look at which
> tests ran, rather than just what their relevance is. Of course, if we want
> to commit to one algorithm, then we don't need any extra information; but
> if we want more flexibility, then maybe storing more information would be
> useful.
>
> Nonetheless, I do understand that the change may be invasive, and it would
> go into an important production system, so it is reasonable to want to
> avoid it. I just want to point out advantages and disadvantages, to not
> dismiss it completely.
>
>
>> One thing that I want to see there is fully developed platform mode. I
>> see that mode option is still there, so it should not be difficult. I
>> actually did it myself while experimenting, but since I only made hasty and
>> crude changes, I don't expect them to be useful.
>>
>
> I'm not sure what code you are referring to. Can you be more specific on
> what seems to be missing? I might have missed something when migrating from
> the previous architecture...
>
> Of the code that's definitely not there, there are a couple things that
> could be added:
> 1. When we calculate the relevance of a test on a given platform, we might
> want to set the relevance to 0, or we might want to derive a default
> relevance from other platforms (An average, the 'standard', etc...).
> Currently, it's just set to 0.
>
> 2. We might also, just in case, want to keep the 'standard' queue for when
> we don't have the data for this platform (related to the previous point).
>
>
>> It doesn't matter in which order they fail/finish; the problem is, when
>> builder2 starts, it doesn't have information about builder1 results, and
>> builder3 doesn't know anything about the first two. So, the metric for test
>> X could not be increased yet.
>>
>> But in your current calculation, it is. So, naturally, if we happen to
>> catch the failure on builder1, the metric raises dramatically, and the
>> failure will be definitely caught on builders 2 and 3.
>>
>> It is especially important now, when you use incoming lists, and the
>> running sets might be not identical for builders 1-3 even in standard mode.
>>
>
> Right, I see your point. Although if test_run 1 would catch the error,
> test_run 2, although it would be using the same data. might not catch the
> same errors if the running set makes it such that they are pushed out due
> to lower relevance. The effect might not be too big, but it definitely has
> potential to affect the results.
>
> Over-pessimistic part:
>>
>> It is similar to the previous one, but look at the same problem from a
>> different angle. Suppose the push broke test X, and the test started
>> failing on all builders (platforms). So, you have 20 failures, one per test
>> run, for the same push. Now, suppose you caught it on one platform but not
>> on others. Your statistics will still show 19 failures missed vs 1 failure
>> caught, and recall will be dreadful (~0.05). But in fact, the goal is
>> achieved: the failure has been caught for this push. It doesn't really
>> matter whether you catch it 1 time or 20 times. So, recall here should be 1.
>>
>> It should mainly affect per-platform approach, but probably the standard
>> one can also suffer if running sets are not identical for all builders.
>>
>
> Right. It seems that solving these two issues is non-trivial (the test_run
> table does not contain duration of the test_run, or anything). But we can
> keep in mind these issues.
>
>
>>
>> Finally, a couple of small details.
>>
>> I wonder if it's because of different versions or anything, but this
>> didn't work for me:
>>
>> exp = re.compile('([^, ]+) ?([^ ]*)? *.*\[ (fail|disabled|pass|skipped)
>> \]')
>>
>> It would give me an error. I had to modify it this way:
>>
>> exp = re.compile('([^, ]+) ?([^ ]*) *.*\[ (fail|disabled|pass|skipped)
>> \]')
>>
>> From what I see, it should be the same. If you agree, please make the
>> same change (or somehow else get rid of the error).
>>
>
> I guess it's a version issue. I fixed it.
>
>
>>
>>
>> Also, it appears that csv/test_fail_history.csv is the old file. I
>> replaced it with csv/fails_ptest_run.csv in the code. It doesn't matter for
>> the final version, but might be important for experiments.
>>
>
> In the code we should be using *test_fail_history_inv.csv*. That is the
> updated file with ascending test_run id. I will add the instructions for
> creating and using these files into the readme.
>
> Regards
> Pablo
>

Follow ups

References