← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Setback

 

Hello everyone,
Thank you Elena, I really appreciate that you took the time to make a
thorough review of the code. Answers to your comments are inline.
Conclusion at the end : )


1. Structure
>

   - When I started coding, I had no idea how the project would fit in the
   whole system, so, after asking a bit, I wrote a script much like Sergei's
   example: Only with the simulations and statistics in mind and not
   considering the implementation phase. I tried to make it a bit more
   practical to use, but I did not have in mind using this code as part of the
   implementation. Only the lessons learned from it.
   - If you think it's a better idea, we can define exactly what the
   interfaces of my project should be, so that I can write something with
   views on the implementation, rather than a pure research-oriented script.
   - Regarding the multi-mode logic: I decided to keep the logic from all
   modes because it's only a constant overhead (it doesn't affect the
   complexity of the original algorithm significantly), and at first I was
   considering all failures, not only those from tests that ran. I can easily
   add the code to run only the mode of interest.

2. Cleanup
>
> if you call several simulations from run_basic_simulations.py, only the
> very first one will use the correct data and get real results. All
> consequent ones will use the data modified by the previous ones, and the
> results will be totally irrelevant.
>

Yup, this is a huge mistake. When I implemented the series of simulations,
I completely forgot about the metrics, and thought there was no extra
cleanup necessary; but I was wrong. This is the first thing I'll fix.

3. Modes
>
> These flaws should be easy to fix by doing proper cleanup before each
> simulation. But there are also other fragments of code where, for example,
> logic for 'standard' mode is supposed to be always run and is relied upon,
> even if the desired mode is different.
>
> In fact, you build all queues every time. It would be an understandable
> trade-off to save the time on simulations, but you re-run them separately
> anyway, and only return the requested queue.
>

I decided to keep the logic of all modes because it does not affect the
complexity of the algorithm... and yes, I relied on the standard mode
because it always runs. This is not a mistake, but rather it is a 'cheat'
that I was aware of, but indulged in. This is easy to fix - and I don't
think it is too serious of a problem. If I were to code something with
views towards the implementation, I would make sure that only strictly
necessary code would run.

4. Failed tests vs executed tests
> - if we say we can afford running 500 tests, we'd rather run 500 than 20,
> even if some of them never failed before. This will also help us break the
> bubble, especially if we randomize the "tail" (tests with the minimal
> priority that we add to fill the queue). If some of them fail, they'll get
> a proper metric and will migrate to the meaningful part of the queue.
>

I don't understand what you mean by bubble, but I see your point. You are
right. I will work on making sure that we run RUNNING_SET tests, even when
the priority queue contains less than that.


> To populate the queue, You don't really need the information which tests
> had ever been run; you only need to know which ones MTR *wants* to run, if
> the running set is unlimited. If we assume that it passes the list to you,
> and you iterate through it, you can use your metrics for tests that failed
> or were edited before, and a default minimal metric for other tests. Then,
> if the calculated tests are not enough to fill the queue, you'll randomly
> choose from the rest. It won't completely solve the problem of tests that
> never failed and were never edited, but at least it will make it less
> critical.
>

This brings us back to the question of research-only vs
preparing-for-implementation, yes, this can be done.

6. Full / non-full simulation mode
>
> I couldn't understand what the *non*-full simulation mode is for, can you
> explain this?
>

This is from the beginning of the implementation, when I was considering
all failures - even from tests that were not supposed to run. I can remove
this already.


>
> 7. Matching logic (get_test_file_change_history)
>
> The logic where you are trying to match result file names to test names is
> not quite correct. There are some highlights:
>
> There can also be subsuites. Consider the example:
> ./mysql-test/suite/engines/iuds/r/delete_decimal.result
>
> The result file can live not only in /r dir, but also in /t dir, together
> with the test file. It's not cool, but it happens, see for example
> mysql-test/suite/mtr/t/
>
> Here are some other possible patterns for engine/plugin suites:
> ./storage/tokudb/mysql-test/suite/tokudb/r/rows-32m-1.result
> ./storage/innodb_plugin/mysql-test/innodb.result
> Also, in release builds they can be in mysql-test/plugin folder:
> mysql-test/plugin/example/mtr/t
>

When I coded that logic, I used the website that you had suggested, and
thought that I had covered all possibilities. I see that I didn't. I will
work on that too.


>
> Be aware that the logic where you compare branch names doesn't currently
> work as expected. Your list of "fail branches" consists of clean names
> only, e.g. "10.0", while row[BRANCH] can be like
> "lp:~maria-captains/maria/10.0". I'm not sure yet why it is sometimes
> stored this way, but it is.


Yes, I am aware of this; but when I looked at the data, there were very,
very few cases like this, so I decided to just ignore them. All the other
cases (>27 thousand) are presented as clean names. I believe this is
negligible, but if you think otherwise, I can add the logic to parse these
few cases.

 MariaDB [kokiri_apr23]> select branch, count(*) from changes where branch
like 'lp%' group by 1 order by 2;
+------------------------------------------------------+----------+
| branch                                               | count(*) |
+------------------------------------------------------+----------+
| lp:~maria-captains/maria/5.3-knielsen                |        1 |
| lp:~maria-captains/maria/5.5                         |        1 |
| lp:~maria-captains/maria/mariadb-5.1-mwl132          |        1 |
| lp:~maria-captains/maria/mariadb-5.1-mwl163          |        1 |
| lp:~maria-captains/maria/maria-5.1-table-elimination |        3 |
| lp:~maria-captains/maria/10.0-FusionIO               |        7 |
| lp:~maria-captains/maria/5.1-converting              |        9 |
| lp:~maria-captains/maria/10.0-connect                |       15 |
| lp:~maria-captains/maria/10.0                        |       45 |
+------------------------------------------------------+----------+

Lets get back to it (both of them) after the logic with dependent
> simulations is fixed, after that we'll review it and see why it doesn't
> work if it still doesn't. Right now any effect that file edition might have
> is rather coincidental, possibly the other one is also broken.
>

Yup!


>     - Humans are probably a lot better at predicting first failures than
>> the
>>     current strategy.
>>
>
> This is true, unfortunately it's a full time job which we can't afford to
> waste a human resource on.
>

I was not suggesting to hire a person for this : ). What I meant is that a
developer would be a lot better at choosing how to test their own changes,
rather than a script that just accounts for recent failures. This is
information that a developer can easily leverage on his own, along with his
code changes, etc... I was trying to say that considering all this, I need
to improve the algorithm.

In conclusion, this is what I will work on now:

   - Add cleanup for every run - I just added (and pushed) this. Right now
   I am running tests with this logic.
   - Get new results, graph and report them ASAP.
   - Add randomized running of tests that have not ran or been modified - I
   will make this optional.
   - Fix the logic that converts test result file names to test names

Questions to consider:

   - Should we define precisely what the implementation should look like,
   so that I can code the core, wrapper, etc? Or should I focus on finishing
   the experiments and the 'research' first?
   - I believe the logic to compare branch names should be good enough, but
   if you consider that it should be 100% fixed, I can do that. What do you
   think?

Regards
Pablo

Follow ups

References