← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy


Hi Elena,

> First, about integration. I agree it makes much more sense to store
> metrics rather than calculate them each time from scratch. As an additional
> bonus, it becomes unimportant whether we make buildbot store lists of all
> executed tests as we initially discussed. If the tool stores metrics, it
> only needs to receive the list of tests at runtime, which seems much less
> invasive.

I see your point. I still think there is a benefit to having a list of run
tests: If we want to run simulations or analyze other algorithms with the
data, rather than just storing the test_info dictionary, having the list of
tests that ran would be more useful, as we could actually look at which
tests ran, rather than just what their relevance is. Of course, if we want
to commit to one algorithm, then we don't need any extra information; but
if we want more flexibility, then maybe storing more information would be

Nonetheless, I do understand that the change may be invasive, and it would
go into an important production system, so it is reasonable to want to
avoid it. I just want to point out advantages and disadvantages, to not
dismiss it completely.

> One thing that I want to see there is fully developed platform mode. I see
> that mode option is still there, so it should not be difficult. I actually
> did it myself while experimenting, but since I only made hasty and crude
> changes, I don't expect them to be useful.

I'm not sure what code you are referring to. Can you be more specific on
what seems to be missing? I might have missed something when migrating from
the previous architecture...

Of the code that's definitely not there, there are a couple things that
could be added:
1. When we calculate the relevance of a test on a given platform, we might
want to set the relevance to 0, or we might want to derive a default
relevance from other platforms (An average, the 'standard', etc...).
Currently, it's just set to 0.

2. We might also, just in case, want to keep the 'standard' queue for when
we don't have the data for this platform (related to the previous point).

> It doesn't matter in which order they fail/finish; the problem is, when
> builder2 starts, it doesn't have information about builder1 results, and
> builder3 doesn't know anything about the first two. So, the metric for test
> X could not be increased yet.
> But in your current calculation, it is. So, naturally, if we happen to
> catch the failure on builder1, the metric raises dramatically, and the
> failure will be definitely caught on builders 2 and 3.
> It is especially important now, when you use incoming lists, and the
> running sets might be not identical for builders 1-3 even in standard mode.

Right, I see your point. Although if test_run 1 would catch the error,
test_run 2, although it would be using the same data. might not catch the
same errors if the running set makes it such that they are pushed out due
to lower relevance. The effect might not be too big, but it definitely has
potential to affect the results.

Over-pessimistic part:
> It is similar to the previous one, but look at the same problem from a
> different angle. Suppose the push broke test X, and the test started
> failing on all builders (platforms). So, you have 20 failures, one per test
> run, for the same push. Now, suppose you caught it on one platform but not
> on others. Your statistics will still show 19 failures missed vs 1 failure
> caught, and recall will be dreadful (~0.05). But in fact, the goal is
> achieved: the failure has been caught for this push. It doesn't really
> matter whether you catch it 1 time or 20 times. So, recall here should be 1.
> It should mainly affect per-platform approach, but probably the standard
> one can also suffer if running sets are not identical for all builders.

Right. It seems that solving these two issues is non-trivial (the test_run
table does not contain duration of the test_run, or anything). But we can
keep in mind these issues.

> Finally, a couple of small details.
> I wonder if it's because of different versions or anything, but this
> didn't work for me:
> exp = re.compile('([^, ]+) ?([^ ]*)? *.*\[ (fail|disabled|pass|skipped)
> \]')
> It would give me an error. I had to modify it this way:
> exp = re.compile('([^, ]+) ?([^ ]*) *.*\[ (fail|disabled|pass|skipped) \]')
> From what I see, it should be the same. If you agree, please make the same
> change (or somehow else get rid of the error).

I guess it's a version issue. I fixed it.

> Also, it appears that csv/test_fail_history.csv is the old file. I
> replaced it with csv/fails_ptest_run.csv in the code. It doesn't matter for
> the final version, but might be important for experiments.

In the code we should be using *test_fail_history_inv.csv*. That is the
updated file with ascending test_run id. I will add the instructions for
creating and using these files into the readme.


Follow ups