← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy

 

Hello Elena,
You raise good points. I have just rewritten the save_state and load_state
functions. Now they work with a MySQL database and a table that looks like
this:

create table kokiri_data  ( dict varchar(20), labels varchar(200), value
varchar(100), primary key (dict,labels));

Since I wanted to store many dicts into the database, I decided to try this
format. The 'dict' field includes the dictionary that the data belongs to
('upd_count','pred_count' or 'test_info'). The 'labels' field includes the
space-separated list of labels in the dictionary (for a more detailed
explanation, check the README and the code). The value contains the value
of the datum (count of runs, relevance, etc.)

Since the labels are space-separated, this assumes we are not using the
mixed mode. If we use mixed mode, we may change the separator (, or & or %
or $ are good alternatives).

Let me know what you think about this strategy to store into the database.
I felt it was the most simple one, while still allowing to do some querying
on the database (like loading only one metric or one 'unit'
(platform/branch/mix), etc). It may also allow to store many configurations
if necessary.

Regards
Pablo


On Sat, Aug 9, 2014 at 8:26 AM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
wrote:

> Hi Pablo,
>
> Thanks for the update. Couple of comments inline.
>
>
> On 08.08.2014 18:17, Pablo Estrada wrote:
>
>> Hello Elena,
>> I just pushed a transaction, with the following changes:
>>
>> 1. Added an internal counter to the kokiri class, and a function to expose
>> it. This function can show how many update result runs and prediction runs
>> have been run in total, or per unit (an unit being a platform, a branch or
>> a mix of both). Using this counter, one can decide to add logic for extra
>> learning rounds for new platforms (I added it to the wrapper class as an
>> example).
>>
>> 2. Added functions to load and store status into temporary storage. They
>> are very simple - they only serialize to a JSON file, but they can be
>> easily modified to fit the requirements of the implementation. I can add
>> this in the README. If you'd like for me to add the capacity to connect to
>> a database and store the data in a table, I can do that too (I think it
>>
>
> Yes, I think we'll have to have it stored in the database.
> Chances are, the scripts will run on buildbot slaves rather than on the
> master, so storing data in a file just won't do any good.
>
>
>  would be easiest to store the dicts as json data in text fields). Let me
>> know if you'd prefer that.
>>
>
> I don't like the idea of storing the entire dicts as json. It doesn't seem
> to be justified by... well... anything, except for saving a tiny bit of
> time on writing queries. But that's a one-time effort, while this way we
> won't be able to [easily] join the statistical data with, lets say,
> existing buildbot tables; and it generally won't be efficient and easy to
> read.
>
> Besides, keep in mind that for real use, if, lets say, we are running in
> 'platform' mode, for each call we don't need the whole dict, we only need
> the part of dict which relates to this platform, and possibly the standard
> one. So, there is really no point loading other 20 platforms' data, which
> you will almost inevitably do if you store it in a single json.
>
> The real (not json-ed) data structure seems quite suitable for SQL, so it
> makes sense to store it as such.
>
> If you think it will take you long to do that, it's not critical: just
> create an example interface for connecting to a database and running *some*
> queries to store/read the data, and we'll tune it later.
>
> Regards,
> Elena
>
>
>
>> By the way, these functions allow the two parts of the algorithm to be
>> called separately, e.g.:
>>
>> Predicting phase (can be done depending of counts of training rounds for
>> platform, etc..)
>> 1. Create kokiri instance
>> 2. Load status (call load_status)
>> 3. Input test list, get smaller output
>> 4. Eliminate instance from memory (no need to save state since nothing
>> changes until results are updated)
>>
>> Training phase:
>> 1. Create kokiri instance
>> 2. Load status (call load_status)
>> 3. Feed new information
>> 4. Save status (call save_status)
>> 5. Eliminate instance from memory
>>
>> I added tests that check the new features to the wrapper. Both features
>> seem to be working okay. Of course, the more prediction rounds for new
>> platforms, the platform mode improves a bit, but not too dramatically, for
>> what I've seen. I'll test it a bit more.
>>
>> I will also add these features to the file_change_correlations branch, and
>> document everything in the README file.
>>
>> Regards
>> Pablo
>>
>>
>> On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
>> wrote:
>>
>>  (sorry, forgot the list in my reply, resending)
>>>
>>> Hi Pablo,
>>>
>>>
>>>
>>> On 03.08.2014 17:51, Pablo Estrada wrote:
>>>
>>>> Hi Elena,
>>>>
>>>>
>>>>  One thing that I want to see there is fully developed platform mode. I
>>>>>
>>>> see
>>>
>>>> that mode option is still there, so it should not be difficult. I
>>>>>
>>>> actually
>>>
>>>> did it myself while experimenting, but since I only made hasty and crude
>>>>> changes, I don't expect them to be useful.
>>>>>
>>>>>
>>>> I'm not sure what code you are referring to. Can you be more specific on
>>>> what seems to be missing? I might have missed something when migrating
>>>>
>>> from
>>>
>>>> the previous architecture...
>>>>
>>>
>>> I was mainly referring to the learning stage. Currently, the learning
>>> stage is "global". You go through X test runs, collect data, distribute
>>> it
>>> between platform-specific queues, and from X+1 test run you start
>>> predicting based on whatever platform-specific data you have at the
>>> moment.
>>>
>>> But this is bound to cause rather sporadic quality of prediction, because
>>> it could happen that out of 3000 learning runs, 1000 belongs to platform
>>> A,
>>> while platform B only had 100, and platform C was introduced later, after
>>> your learning cycle. So, for platform B the statistical data will be very
>>> limited, and for platform C there will be none -- you will simply start
>>> randomizing tests from the very beginning (or using data from other
>>> platforms as you suggest below, which is still not quite the same as pure
>>> platform-specific approach).
>>>
>>> It seems more reasonable, if the platform-specific mode is used, to do
>>> learning per platform too. It is not just about current investigation
>>> activity, but about the real-life implementation too.
>>>
>>> Lets suppose tomorrow we start collecting the data and calculating the
>>> metrics.
>>> Some platforms will run more often than others, so lets say in 2 weeks
>>> you
>>> will have X test runs on these platforms so you can start predicting for
>>> them; while other platforms will run less frequently, and it will take 1
>>> month to collect the same amount of data.
>>> And 2 months later there will be Ubuntu Utopic Unicorn which will have no
>>> statistical data at all, and it will be cruel to jump into predicting
>>> there
>>> right away, without any statistical data at all.
>>>
>>> It sounds more complicated than it is, in fact pretty much all you need
>>> to
>>> add to your algorithm is making 'count' in your run_simulation a dict
>>> rather than a constant.
>>>
>>> So, I imagine that when you store your metrics after a test run, you will
>>> also store a number of test runs per platform, and only start predicting
>>> for this particular platform when the count for it reaches the configured
>>> number.
>>>
>>>
>>>> Of the code that's definitely not there, there are a couple things that
>>>> could be added:
>>>> 1. When we calculate the relevance of a test on a given platform, we
>>>>
>>> might
>>>
>>>> want to set the relevance to 0, or we might want to derive a default
>>>> relevance from other platforms (An average, the 'standard', etc...).
>>>> Currently, it's just set to 0.
>>>>
>>>
>>> I think you could combine this idea with what was described above. While
>>> it makes sense to run *some* full learning cycles on a new platform, it
>>> does not have to be thousands, especially since some non-LTS platforms
>>> come
>>> and go awfully fast. So, we run these no-too-many cycles, get clean
>>> platform-specific data, and if necessary enrich it with the other
>>> platforms' data.
>>>
>>>
>>>
>>>> 2. We might also, just in case, want to keep the 'standard' queue for
>>>>
>>> when
>>>
>>>> we don't have the data for this platform (related to the previous
>>>> point).
>>>>
>>>
>>> If we do what's described above, we should always have data for the
>>> platform.
>>> But if you mean calculating and storing the standard metrics, then yes --
>>> since we are going to store the values rather than re-calculate them
>>> every
>>> time, there is no reason to be greedy about it. It might even make sense
>>> to
>>> calculate both metrics that you developed, too. Who knows maybe one day
>>> we'll find out that the other one gives us better results.
>>>
>>>
>>>>
>>>>  It doesn't matter in which order they fail/finish; the problem is, when
>>>>> builder2 starts, it doesn't have information about builder1 results,
>>>>> and
>>>>> builder3 doesn't know anything about the first two. So, the metric for
>>>>>
>>>> test
>>>
>>>> X could not be increased yet.
>>>>>
>>>>> But in your current calculation, it is. So, naturally, if we happen to
>>>>> catch the failure on builder1, the metric raises dramatically, and the
>>>>> failure will be definitely caught on builders 2 and 3.
>>>>>
>>>>> It is especially important now, when you use incoming lists, and the
>>>>> running sets might be not identical for builders 1-3 even in standard
>>>>>
>>>> mode.
>>>
>>>>
>>>>>
>>>> Right, I see your point. Although if test_run 1 would catch the error,
>>>> test_run 2, although it would be using the same data. might not catch
>>>> the
>>>> same errors if the running set makes it such that they are pushed out
>>>> due
>>>> to lower relevance. The effect might not be too big, but it definitely
>>>>
>>> has
>>>
>>>> potential to affect the results.
>>>>
>>>> Over-pessimistic part:
>>>>
>>>>>
>>>>> It is similar to the previous one, but look at the same problem from a
>>>>> different angle. Suppose the push broke test X, and the test started
>>>>> failing on all builders (platforms). So, you have 20 failures, one per
>>>>>
>>>> test
>>>
>>>> run, for the same push. Now, suppose you caught it on one platform but
>>>>>
>>>> not
>>>
>>>> on others. Your statistics will still show 19 failures missed vs 1
>>>>>
>>>> failure
>>>
>>>> caught, and recall will be dreadful (~0.05). But in fact, the goal is
>>>>> achieved: the failure has been caught for this push. It doesn't really
>>>>> matter whether you catch it 1 time or 20 times. So, recall here should
>>>>>
>>>> be 1.
>>>
>>>>
>>>>> It should mainly affect per-platform approach, but probably the
>>>>> standard
>>>>> one can also suffer if running sets are not identical for all builders.
>>>>>
>>>>>
>>>> Right. It seems that solving these two issues is non-trivial (the
>>>>
>>> test_run
>>>
>>>> table does not contain duration of the test_run, or anything). But we
>>>> can
>>>> keep in mind these issues.
>>>>
>>>
>>> Right. At this point it doesn't even make sense to solve hem -- in
>>> real-life application, the first one will be gone naturally, just because
>>> there will be no data from unfinished test runs.
>>>
>>> The second one only affects recall calculation, in other words --
>>> evaluation of the algorithm. It is interesting from theoretical point of
>>> view, but not critical for real-life application.
>>>
>>>
>>>  I fixed up the repositories with updated versions of the queries, as
>>>> well
>>>> as instructions in the README on how to generate them.
>>>>
>>>> Now I am looking a bit at the buildbot code, just to try to suggest some
>>>> design ideas for adding the statistician and the pythia into the MTR
>>>> related classes.
>>>>
>>>
>>>
>>> As you know, we have the soft pencil-down in a few days, and the hard one
>>> a week later. At this point, there isn't much reason to keep frantically
>>> improving the algorithm (which is never perfect), so you are right not
>>> planning on it.
>>>
>>> In the remaining time I suggest to
>>>
>>> - address the points above;
>>> - make sure that everything that should be configurable is configurable
>>> (algorithm, mode, learning set, db connection details);
>>> - create structures to store the metrics and reading to/writing from the
>>> database;
>>> - make sure the predicting and the calculating part can be called
>>> separately;
>>> - update documentation, clean up logging and code in general.
>>>
>>> As long as we have these two parts easily callable, we will find a place
>>> in buildbot/MTR to put them to, so don't waste too much time on it.
>>>
>>> Regards,
>>> Elena
>>>
>>>
>>>
>>>> Regards
>>>> Pablo
>>>>
>>>>
>>>
>>

Follow ups

References