← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy


Hello Elena and all,
First, addressing the previous email:

Looking at the dump, I see it can also happen that the dump contains
> several records for a pair platform/bbnum. I am not sure why it happens, I
> think it shouldn't, might be a bug in buildbot and/or configuration, or
> environmental problems. Anyway, due to the way we store output files, they
> can well override each other in this case, thus for several platform/bbnum
> record you will have only one file. I suppose that's what was hard to
> resolve, sorry about that.

No worries ; ).  There are several cases where platform and build number
are the same. The system just names the files as follows:

These files seem to correspond temporarily with the test runs
(*test_1-stdio) belongs to the first test_run of the same plt-bnum, and so
on. Unfortunately, there are some cases where there are more test_runs on
the dump than files available, and this means that it's impossible to be
sure which file belongs to which test_run exactly.

> You should consider skipped tests, at least for now. Your logic that they
> are skipped because they can't be run is generally correct; unfortunately,
> MTR first produces the *full* list of tests to run, and determines whether
> a test can be run or not on a later stage, when it starts running the
> tests. Your tool will receive the initial test list, and I'm not sure it's
> realistic to re-write MTR so that it takes into account limitations that
> cause skipping tests before creating the list.

I see. Okay then, duly noted.

Possibly it's better to skip a test run altogether if there is no input
> list for it; it would be definitely the best if there were 5K (or whatever
> slice you are currently using) of continuous test runs with input lists; if
> it so happens that there are lists for some branches but not others, you
> can skip the branch entirely.

This doesn't seem like a good option. Recall drops seriously, and the
test_runs that have a corresponding file don't seem to have a special
pattern, and tend to have long spaces between them, so the information
becomes irrelevant, and seemingly, not useful.

> The core module should take as parameters
> - list of tests to choose from,
> - size of the running set (%),
> - branch/platform (if we use them in the end),
> and produce a new list of tests of the size of the running set.
> The wrapper module should
> - read the list of tests from the outside world (for now, from a file),
> - receive branch/platform as command-line options,
> - have the running set size set as an easily changeable constant or as a
> configuration parameter,
> and return the list of tests -- lets say for now, in the form of <test
> suite>.<test name>, blank-separated, e.g.
> main.select innondb.create-index ...
 I am almost done 'translating' the code into a solution that divides it in
'core' and 'wrapper'. There are a few bugs that I still haven't figured
out, but I believe I can iron those out pretty soon. I will also
incorporate the percentage rather than fixed running_set.

Now, regarding the state of the project (and the recall numbers that I am
able to achieve so far), here are some observations:

   - Unfortunately, I am running out of ideas to try to improve recall. I
   tried tuning some parameters, giving more weight to ones or others, etc. I
   still wasn't able to push recall beyond ~87% on the strategy that uses file
   correlations. For what I've seen, some failures are just extremely hard to
   - The strategy that uses only a weighted average of the failure
   frequency achieves a higher recall, but for a shorter time. The recall
   decays quickly afterwards. I may try to add some file-correlations to this
   strategy, to see if the recall can be sustained for a longer term.
   - There is one problem that I see regarding the data and the potential
   real-world implementation of the program: By verifying the recall with the
   historical data, we run the possibility of 'expecting' overfitting... so
   the results regarding the errors found when comparing to the historical
   data, and the results that could have been obtained by a real-world
   implementation are potentially different. A possible way to address that
   issue would require modifying the buildbot to gather more data over a
   longer term.

So having said that, I am looking for some advice in the following regards:

   - I will try to take a step back from the new strategy, and see how I
   can adapt the original strategy to prevent the recall function from
   declining so sharply with time.
   - I will also spend some time keeping a codebase that adjusts better to
   the model that we need for the implementation. I will upload code soon. All
   suggestions are welcome.
   - Nonetheless, I feel that more data would allow to improve the
   algorithm greatly. Is it possible to prepare logging into the buildbot that
   would allow for more precise data collection? A slower, more iterative
   process, working closer with the buildbot and doing more detailed data
   collection might deliver better results. (I understand that this would
   probably influence the time-scope of the project)

Let me know what you think about my suggestions.

Follow ups