maria-developers team mailing list archive

Thread
Date

Re: [GSoC] Optimize mysql-test-runs - Setback

To: Pablo Estrada <polecito.em@xxxxxxxxx>
From: Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
Date: Tue, 24 Jun 2014 15:27:17 +0400
Cc: maria-developers@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CABDWuanxaMymGoTSq0c2VGzN-Sen0FqNr2So5U8o=MyCQ4Q1BQ@mail.gmail.com>
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.6.0

Hi Pablo,


On 23.06.2014 17:29, Pablo Estrada wrote:

Hello Elena,
Here's what I have to report:

    1. I removed the getLogger function from every called. It did improve
    performance significantly.


That's nice.

    2. I also made sure that only the priority queues that concern us are
    built. This did not improve performance much.

It's good that you removed unnecessary parts anyway. With bigger datasets it might make a bit of difference -- not big, but at some pointeverything might count.

    3. Here are my results with 50,000 test runs with randomization,
    test_edit_factor and time_factor. They are not much better. (Should I run
    them without randomization or other options?)


Not just yet, lets analyze the results first.

    4. I started with concepts and a bit of code for a new strategy. I am
    still open to work with the current codebase.

I think it's premature to switch to a new strategy. Before doing that,we need to be clear why the results of the current strategy are notsatisfactory. I have some hypothetical ideas about it, but confirming orruling them out requires more experimentation.


Here are results using 47,000 test_run cycles as training, and another
3,000 for predictions. They don't improve much. My theory is that this
happens because they are really linear: They only leverage information from
very recent runs, and older information becomes irrelevant quickly.

This is an interesting result. The number of training cycles mustmatter, at least it defines which part of the algorithm is working (orfailing). The lack of difference in recall actually means a lot.

For simplicity, lets further in this email assume we are talking aboutthe standard mode without any factors, unless specified otherwise.



There are two reasons for missing a failure:

1) the test is not in the queue at all (it never failed before, index is-1);2) the test is in the queue, but too far from the head (index >running_set).

When you are running your simulation close to the head of the history(2000/5000), you only consider as many failures as happened in the firstcycles, so your queue is rather shallow, even though it's still biggerthan the running set. So, most likely the main part of the 'misses' iscaused by the reason (1) -- the tests are not in the queue at all. Thereisn't much you can do about it, as long as you only use previouslyfailed tests.

Time factor will be also of no help here, because the queue just doesn'tcontain the needed test.

Test editing factor is important, even if it doesn't improve recallshort-term, it should extend the queue.

But when you are running simulation deeper in the history, your queuecontains many more tests to choose from, and you have more material towork with. If currently it doesn't help to improve recall, it means thatmore 'misses' happen on the reason (2), the queues are not properlysorted/filtered. It can and needs to be improved within the currentstrategy.

Unfortunately, the deeper into the history, the more time the scripttakes, which should also be taken into account -- we can't afford 30 minof script in each test run, it will make the whole idea pointless.

Here your last results become really important. If it doesn't matterwhether you calculated metrics based on 2000 runs or 40,000 runs, maybewe won't need to use the complete logic on the whole dataset. Instead,we can very quickly populate the queue with test names and minimalinitial metric values, and only do calculation for the learning set.



Now, about improving sorting/filtering.

For one, I once again suggest to use an *incoming test set* as astarting point for choosing tests. This is important both forappropriate evaluation of the current strategy, and for real-life use.


Here is what happens now:

- a test run ran some 3000 tests, and 10 of them failed;

- your queue contains 1500 failed tests, somehow arranged -- these aretests from all possible MTR configurations;- you take the "first" 500 tests, compare them with the list of failuresin the simulated test run;

- lets say you get 6 intersections, so your recall is 0.6.

The problem here is that the queue contains *any* tests, many of whichthis test run didn't use at all, so they couldn't possibly have failed.It can contain pbxt tests which are already gone; or unit tests whichare only run in special MTR runs; or plugin tests for a plugin which isnot present in this build; and so on.

So, while your resulting queue contains 500 tests, there are only letssay 200 tests which were *really* run in that test run.If you had considered that, your recall 0.6 would have been not for RS500, but for RS 200, which is better.

Or, if while populating the queue you had ignored irrelevant tests, therelevant ones would end up much closer to the head of the queue, andwould probably make it to the running set 500, thus you would have"caught" more failures with RS 500, and recall would have been better.

It will be even more important in the real-life use, because when wewant to pick N% of tests, we don't want *any* tests: each MTR run isconfigured to run certain logical subset of the whole MTR suite, and wewant to choose from it only.

In real life, MTR creates the complete list of tests at the beginning.It should be easy enough to pass it to your script.

For your experiments, while test lists are not in the database, they canbe easily enough extracted from test logs which I can upload for you fora certain number of test runs. Only in order to do that, you need tostart using the end of your data set (the most recent test runs),because we might not have the old logs. It's an easy thing to do, youwill just need to skip first len(test_history) - max_limit runs.

You'll need to send me the range of test run IDs for which you need thelogs.

The logs look like that:http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-quantal-x86/builds/586/steps/test/logs/stdio

That is, they are text files which are easy enough to parse. You willneed to choose lines which contain [ pass ] or [ fail ] or [ skipped ]or [ disabled ] (yes, skipped and disabled too, because they will be onthe list that MTR initially creates).

Further, before you start rethinking the strategy of *choosing* tests,you should analyze why the current one isn't working.


Did you try to see the dynamics of recall within a single experiment?

I mean, you go through 2000 training runs where you take into accountall available information and calculate metrics based on it.

Then, you run 3000 simulation sets where you calculate recall andre-calculate metrics, but now you only take into account informationwhich would be available if the test run used the simulation strategy.This is a right thing to do; but did you check what happens with recallover these 3000 runs?

What I expect is that it's very good at the beginning of the simulationset, because you use the full and recent data, so recall will be closeto 1. But then, it will begin deteriorate.

If so, the real question is not how to improve the metrics and queuingalgorithms, but how to preserve the accuracy.

That's where the strategy might need some adjustments, I don't haveready-to-use suggestions, we need to understand how exactly itdeteriorates. I'll look into it more.


I started coding a bit of a new approach, looking at correlation between
events since last test run and test failures. So for instance:

    - Correlation between files changed and tests failed
    - Correlation between failures in the last test run and the new one
    (tests that fail several times subsequently are more relevant)

I just started, so this is only the idea, and there is not much code in
place. I believe I can code most of it in less than a week.

Of course, I am still spending time thinking how to improve the current
strategy, and am definitely open to any kind of advice. Please let me know
your thoughts.

See the above. I'm afraid making correlation between code changes andtest failures to work accurately might take much longer than initialcoding, so I'd rather we focus on analyzing and improving functionalitythat we already have. That said, if you already have results, of courseby all means share them, lets see how promising it looks.



Regards,
Elena


Regards
Pablo


On Mon, Jun 23, 2014 at 1:08 AM, Pablo Estrada <polecito.em@xxxxxxxxx>
wrote:

Hi Elena,
I ran these tests using the time factor but not the test edit factor.
I will make the code changes and run the test on a bigger scale then.
I will take a serious look through the code to try to squeeze out as much
performance as possible as well : )

Regards
Pablo
On Jun 23, 2014 1:01 AM, "Elena Stepanova" <elenst@xxxxxxxxxxxxxxxx>
wrote:

Hi Pablo,

Thanks for the update.
I'm looking into it right now, but meanwhile I have one quick suggestion.

Currently your experiments are being run on a small part of the
historical data (5% or so). From all I see, you can't afford running on a
bigger share even if you want to, because the script is slow. Since it's
obvious that you will need to run it many more times before we achieve the
results we hope for, it's worth investing a little bit of time into the
performance.

For starters, please remove logger initialization from internal
functions. Now you call getLogger from a couple of functions, including the
one calculating the metric, which means that it's called literally millions
of times even on a small part of the data set.

Instead, make logger a member of the simulator class, initialize it once,
e.g. in __init__, I expect you'll gain quite a lot by this no-cost change.

If it becomes faster, please run the same tests with e.g. ~50% of data
(learning set 47,000 max_count 50,000), or less if it's still not fast
enough. No need to run all run_set values, do for example 100 and 500. It's
interesting to see whether using the deeper history makes essential
difference, I expect it might, but not sure.

Please also indicate which parameters the experiments were run with
(editing and timing factors).

Regards,
Elena


On 22.06.2014 18:13, Pablo Estrada wrote:

Hello everyone,
I ran the tests with randomization on Standard and Mixed mode, and here
are
the results.
1. Standard does not experience variation - The queue is always long
enough.
2. Mixed does experience some variation - Actually, the number of tests
run
changes dramatically, but I forgot to add the data in the chart. I can
report it too, but yes, the difference is large.
3. In any case, the results are still not quite satisfactory, so we can
think back to what I had mentioned earlier: How should we change our
paradigm to try to improve our chances?

Regards
Pablo


On Fri, Jun 20, 2014 at 7:45 PM, Pablo Estrada <polecito.em@xxxxxxxxx>
wrote:

  I have pushed my latest version of the code, and here is a test run that

ran on this version of the code. It is quite different from the original
expectation; so I'm taking a close look at the code for bugs, and will
run
another simulation ASAP  (I'll use less data to make it faster).


On Thu, Jun 19, 2014 at 5:16 PM, Elena Stepanova <
elenst@xxxxxxxxxxxxxxxx>
wrote:

  Hi Pablo,


I'll send a more detailed reply later, just a couple of quick
comments/questions now.

To your question



  I'm just not quite sure what you mean with this example:

mysql-test/plugin/example/mtr/t

In this example, what is the test name? And what is exactly the path?
(./mysql-test/...) or (./something/mysql-test/...)? I tried to look at
some
of the test result files but I couldn't find one certain example of
this
pattern (Meaning that I'm not sure what would be a real instance of
it).
Can you be more specific please?


  I meant that if you look into the folder

<tree>/mysql-test/suite/mtr/t/ ,
you'll see an example of what I described as "The result file can live
not
only in /r dir, but also in /t dir, together with the test file":

ls mysql-test/suite/mtr/t/
combs.combinations
combs.inc
inc.inc
newcomb.result
newcomb.test
proxy.inc
self.result
self.test
simple,c2,s1.rdiff
simple.combinations
simple.result
simple,s2,c2.rdiff
simple,s2.result
simple.test
single.result
single.test
source.result
source.test
test2.result
test2.test
testsh.result
testsh.test

As far as I remember, your matching algorithm didn't cover that.



   Here are the results. They are both a bit counterintuitive, and a bit

strange

Have you already done anything regarding (not) populating the queue
completely? I did expect that with the current logic, after adding full
cleanup between simulations, the more restrictive configuration would
have
lower recall, because it generally runs much fewer tests.

It would be interesting to somehow indicate in the results how many
tests
were *actually* run. But if you don't have this information, please
don't
re-run the full set just for the sake of it, maybe run only one
running set
for standard/platform/branch/mixed, and let us see the results. No
need
to spend time on graphs for that, a text form will be ok.

Either way, please push the current code, I'd like to see it before I
come up with any suggestions about the next big moves.

Regards,
Elena

References

[GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-12
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Elena Stepanova, 2014-06-12
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-13
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Elena Stepanova, 2014-06-16
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-17
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Elena Stepanova, 2014-06-17
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-19
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Elena Stepanova, 2014-06-19
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-20
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-22
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Elena Stepanova, 2014-06-22
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-22
Re: [GSoC] Optimize mysql-test-runs - Setback
From: Pablo Estrada, 2014-06-23