← Back to team overview

maria-developers team mailing list archive

Re: [GSoC] Optimize mysql-test-runs - Results of new strategy

 

Hello Elena and all,
I have submitted the concluding commit to the project with a very
short 'RESULTS' file that explains briefly the project, the different
strategies and the results. It includes a chart with updated results
for both strategies and different modes. If you think I should add
anything else, please let me know.
Here it is:
https://github.com/pabloem/Kokiri/blob/master/RESULTS.md

Thank you very much.
Regards

Pablo

On 8/13/14, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx> wrote:
> Hi Pablo,
>
> On 10.08.2014 9:31, Pablo Estrada wrote:
>> Hello Elena,
>> You raise good points. I have just rewritten the save_state and
>> load_state
>> functions. Now they work with a MySQL database and a table that looks
>> like
>> this:
>>
>> create table kokiri_data  ( dict varchar(20), labels varchar(200), value
>> varchar(100), primary key (dict,labels));
>>
>> Since I wanted to store many dicts into the database, I decided to try
>> this
>> format. The 'dict' field includes the dictionary that the data belongs to
>> ('upd_count','pred_count' or 'test_info'). The 'labels' field includes
>> the
>> space-separated list of labels in the dictionary (for a more detailed
>> explanation, check the README and the code). The value contains the value
>> of the datum (count of runs, relevance, etc.)
>>
>> Since the labels are space-separated, this assumes we are not using the
>> mixed mode. If we use mixed mode, we may change the separator (, or & or
>> %
>> or $ are good alternatives).
>>
>> Let me know what you think about this strategy to store into the
>> database.
>> I felt it was the most simple one, while still allowing to do some
>> querying
>> on the database (like loading only one metric or one 'unit'
>> (platform/branch/mix), etc). It may also allow to store many
>> configurations
>> if necessary.
>
> Okay, lets have it this way. We can change it later if we want to.
>
> In the remaining time, you can do the cleanup, check documentation, and
> maybe run some last clean experiments with the existing data and
> different parameters (modes, metrics etc.), to have the statistical
> results with the latest code, which we'll use later to decide on the
> final configuration.
>
> Regards,
> Elena
>
>>
>> Regards
>> Pablo
>>
>>
>> On Sat, Aug 9, 2014 at 8:26 AM, Elena Stepanova <elenst@xxxxxxxxxxxxxxxx>
>> wrote:
>>
>>> Hi Pablo,
>>>
>>> Thanks for the update. Couple of comments inline.
>>>
>>>
>>> On 08.08.2014 18:17, Pablo Estrada wrote:
>>>
>>>> Hello Elena,
>>>> I just pushed a transaction, with the following changes:
>>>>
>>>> 1. Added an internal counter to the kokiri class, and a function to
>>>> expose
>>>> it. This function can show how many update result runs and prediction
>>>> runs
>>>> have been run in total, or per unit (an unit being a platform, a branch
>>>> or
>>>> a mix of both). Using this counter, one can decide to add logic for
>>>> extra
>>>> learning rounds for new platforms (I added it to the wrapper class as
>>>> an
>>>> example).
>>>>
>>>> 2. Added functions to load and store status into temporary storage.
>>>> They
>>>> are very simple - they only serialize to a JSON file, but they can be
>>>> easily modified to fit the requirements of the implementation. I can
>>>> add
>>>> this in the README. If you'd like for me to add the capacity to connect
>>>> to
>>>> a database and store the data in a table, I can do that too (I think it
>>>>
>>>
>>> Yes, I think we'll have to have it stored in the database.
>>> Chances are, the scripts will run on buildbot slaves rather than on the
>>> master, so storing data in a file just won't do any good.
>>>
>>>
>>>   would be easiest to store the dicts as json data in text fields). Let
>>> me
>>>> know if you'd prefer that.
>>>>
>>>
>>> I don't like the idea of storing the entire dicts as json. It doesn't
>>> seem
>>> to be justified by... well... anything, except for saving a tiny bit of
>>> time on writing queries. But that's a one-time effort, while this way we
>>> won't be able to [easily] join the statistical data with, lets say,
>>> existing buildbot tables; and it generally won't be efficient and easy
>>> to
>>> read.
>>>
>>> Besides, keep in mind that for real use, if, lets say, we are running in
>>> 'platform' mode, for each call we don't need the whole dict, we only
>>> need
>>> the part of dict which relates to this platform, and possibly the
>>> standard
>>> one. So, there is really no point loading other 20 platforms' data,
>>> which
>>> you will almost inevitably do if you store it in a single json.
>>>
>>> The real (not json-ed) data structure seems quite suitable for SQL, so
>>> it
>>> makes sense to store it as such.
>>>
>>> If you think it will take you long to do that, it's not critical: just
>>> create an example interface for connecting to a database and running
>>> *some*
>>> queries to store/read the data, and we'll tune it later.
>>>
>>> Regards,
>>> Elena
>>>
>>>
>>>
>>>> By the way, these functions allow the two parts of the algorithm to be
>>>> called separately, e.g.:
>>>>
>>>> Predicting phase (can be done depending of counts of training rounds
>>>> for
>>>> platform, etc..)
>>>> 1. Create kokiri instance
>>>> 2. Load status (call load_status)
>>>> 3. Input test list, get smaller output
>>>> 4. Eliminate instance from memory (no need to save state since nothing
>>>> changes until results are updated)
>>>>
>>>> Training phase:
>>>> 1. Create kokiri instance
>>>> 2. Load status (call load_status)
>>>> 3. Feed new information
>>>> 4. Save status (call save_status)
>>>> 5. Eliminate instance from memory
>>>>
>>>> I added tests that check the new features to the wrapper. Both features
>>>> seem to be working okay. Of course, the more prediction rounds for new
>>>> platforms, the platform mode improves a bit, but not too dramatically,
>>>> for
>>>> what I've seen. I'll test it a bit more.
>>>>
>>>> I will also add these features to the file_change_correlations branch,
>>>> and
>>>> document everything in the README file.
>>>>
>>>> Regards
>>>> Pablo
>>>>
>>>>
>>>> On Wed, Aug 6, 2014 at 8:04 PM, Elena Stepanova
>>>> <elenst@xxxxxxxxxxxxxxxx>
>>>> wrote:
>>>>
>>>>   (sorry, forgot the list in my reply, resending)
>>>>>
>>>>> Hi Pablo,
>>>>>
>>>>>
>>>>>
>>>>> On 03.08.2014 17:51, Pablo Estrada wrote:
>>>>>
>>>>>> Hi Elena,
>>>>>>
>>>>>>
>>>>>>   One thing that I want to see there is fully developed platform mode.
>>>>>> I
>>>>>>>
>>>>>> see
>>>>>
>>>>>> that mode option is still there, so it should not be difficult. I
>>>>>>>
>>>>>> actually
>>>>>
>>>>>> did it myself while experimenting, but since I only made hasty and
>>>>>> crude
>>>>>>> changes, I don't expect them to be useful.
>>>>>>>
>>>>>>>
>>>>>> I'm not sure what code you are referring to. Can you be more specific
>>>>>> on
>>>>>> what seems to be missing? I might have missed something when
>>>>>> migrating
>>>>>>
>>>>> from
>>>>>
>>>>>> the previous architecture...
>>>>>>
>>>>>
>>>>> I was mainly referring to the learning stage. Currently, the learning
>>>>> stage is "global". You go through X test runs, collect data,
>>>>> distribute
>>>>> it
>>>>> between platform-specific queues, and from X+1 test run you start
>>>>> predicting based on whatever platform-specific data you have at the
>>>>> moment.
>>>>>
>>>>> But this is bound to cause rather sporadic quality of prediction,
>>>>> because
>>>>> it could happen that out of 3000 learning runs, 1000 belongs to
>>>>> platform
>>>>> A,
>>>>> while platform B only had 100, and platform C was introduced later,
>>>>> after
>>>>> your learning cycle. So, for platform B the statistical data will be
>>>>> very
>>>>> limited, and for platform C there will be none -- you will simply
>>>>> start
>>>>> randomizing tests from the very beginning (or using data from other
>>>>> platforms as you suggest below, which is still not quite the same as
>>>>> pure
>>>>> platform-specific approach).
>>>>>
>>>>> It seems more reasonable, if the platform-specific mode is used, to do
>>>>> learning per platform too. It is not just about current investigation
>>>>> activity, but about the real-life implementation too.
>>>>>
>>>>> Lets suppose tomorrow we start collecting the data and calculating the
>>>>> metrics.
>>>>> Some platforms will run more often than others, so lets say in 2 weeks
>>>>> you
>>>>> will have X test runs on these platforms so you can start predicting
>>>>> for
>>>>> them; while other platforms will run less frequently, and it will take
>>>>> 1
>>>>> month to collect the same amount of data.
>>>>> And 2 months later there will be Ubuntu Utopic Unicorn which will have
>>>>> no
>>>>> statistical data at all, and it will be cruel to jump into predicting
>>>>> there
>>>>> right away, without any statistical data at all.
>>>>>
>>>>> It sounds more complicated than it is, in fact pretty much all you
>>>>> need
>>>>> to
>>>>> add to your algorithm is making 'count' in your run_simulation a dict
>>>>> rather than a constant.
>>>>>
>>>>> So, I imagine that when you store your metrics after a test run, you
>>>>> will
>>>>> also store a number of test runs per platform, and only start
>>>>> predicting
>>>>> for this particular platform when the count for it reaches the
>>>>> configured
>>>>> number.
>>>>>
>>>>>
>>>>>> Of the code that's definitely not there, there are a couple things
>>>>>> that
>>>>>> could be added:
>>>>>> 1. When we calculate the relevance of a test on a given platform, we
>>>>>>
>>>>> might
>>>>>
>>>>>> want to set the relevance to 0, or we might want to derive a default
>>>>>> relevance from other platforms (An average, the 'standard', etc...).
>>>>>> Currently, it's just set to 0.
>>>>>>
>>>>>
>>>>> I think you could combine this idea with what was described above.
>>>>> While
>>>>> it makes sense to run *some* full learning cycles on a new platform,
>>>>> it
>>>>> does not have to be thousands, especially since some non-LTS platforms
>>>>> come
>>>>> and go awfully fast. So, we run these no-too-many cycles, get clean
>>>>> platform-specific data, and if necessary enrich it with the other
>>>>> platforms' data.
>>>>>
>>>>>
>>>>>
>>>>>> 2. We might also, just in case, want to keep the 'standard' queue for
>>>>>>
>>>>> when
>>>>>
>>>>>> we don't have the data for this platform (related to the previous
>>>>>> point).
>>>>>>
>>>>>
>>>>> If we do what's described above, we should always have data for the
>>>>> platform.
>>>>> But if you mean calculating and storing the standard metrics, then yes
>>>>> --
>>>>> since we are going to store the values rather than re-calculate them
>>>>> every
>>>>> time, there is no reason to be greedy about it. It might even make
>>>>> sense
>>>>> to
>>>>> calculate both metrics that you developed, too. Who knows maybe one
>>>>> day
>>>>> we'll find out that the other one gives us better results.
>>>>>
>>>>>
>>>>>>
>>>>>>   It doesn't matter in which order they fail/finish; the problem is,
>>>>>> when
>>>>>>> builder2 starts, it doesn't have information about builder1 results,
>>>>>>> and
>>>>>>> builder3 doesn't know anything about the first two. So, the metric
>>>>>>> for
>>>>>>>
>>>>>> test
>>>>>
>>>>>> X could not be increased yet.
>>>>>>>
>>>>>>> But in your current calculation, it is. So, naturally, if we happen
>>>>>>> to
>>>>>>> catch the failure on builder1, the metric raises dramatically, and
>>>>>>> the
>>>>>>> failure will be definitely caught on builders 2 and 3.
>>>>>>>
>>>>>>> It is especially important now, when you use incoming lists, and the
>>>>>>> running sets might be not identical for builders 1-3 even in
>>>>>>> standard
>>>>>>>
>>>>>> mode.
>>>>>
>>>>>>
>>>>>>>
>>>>>> Right, I see your point. Although if test_run 1 would catch the
>>>>>> error,
>>>>>> test_run 2, although it would be using the same data. might not catch
>>>>>> the
>>>>>> same errors if the running set makes it such that they are pushed out
>>>>>> due
>>>>>> to lower relevance. The effect might not be too big, but it
>>>>>> definitely
>>>>>>
>>>>> has
>>>>>
>>>>>> potential to affect the results.
>>>>>>
>>>>>> Over-pessimistic part:
>>>>>>
>>>>>>>
>>>>>>> It is similar to the previous one, but look at the same problem from
>>>>>>> a
>>>>>>> different angle. Suppose the push broke test X, and the test started
>>>>>>> failing on all builders (platforms). So, you have 20 failures, one
>>>>>>> per
>>>>>>>
>>>>>> test
>>>>>
>>>>>> run, for the same push. Now, suppose you caught it on one platform
>>>>>> but
>>>>>>>
>>>>>> not
>>>>>
>>>>>> on others. Your statistics will still show 19 failures missed vs 1
>>>>>>>
>>>>>> failure
>>>>>
>>>>>> caught, and recall will be dreadful (~0.05). But in fact, the goal is
>>>>>>> achieved: the failure has been caught for this push. It doesn't
>>>>>>> really
>>>>>>> matter whether you catch it 1 time or 20 times. So, recall here
>>>>>>> should
>>>>>>>
>>>>>> be 1.
>>>>>
>>>>>>
>>>>>>> It should mainly affect per-platform approach, but probably the
>>>>>>> standard
>>>>>>> one can also suffer if running sets are not identical for all
>>>>>>> builders.
>>>>>>>
>>>>>>>
>>>>>> Right. It seems that solving these two issues is non-trivial (the
>>>>>>
>>>>> test_run
>>>>>
>>>>>> table does not contain duration of the test_run, or anything). But we
>>>>>> can
>>>>>> keep in mind these issues.
>>>>>>
>>>>>
>>>>> Right. At this point it doesn't even make sense to solve hem -- in
>>>>> real-life application, the first one will be gone naturally, just
>>>>> because
>>>>> there will be no data from unfinished test runs.
>>>>>
>>>>> The second one only affects recall calculation, in other words --
>>>>> evaluation of the algorithm. It is interesting from theoretical point
>>>>> of
>>>>> view, but not critical for real-life application.
>>>>>
>>>>>
>>>>>   I fixed up the repositories with updated versions of the queries, as
>>>>>> well
>>>>>> as instructions in the README on how to generate them.
>>>>>>
>>>>>> Now I am looking a bit at the buildbot code, just to try to suggest
>>>>>> some
>>>>>> design ideas for adding the statistician and the pythia into the MTR
>>>>>> related classes.
>>>>>>
>>>>>
>>>>>
>>>>> As you know, we have the soft pencil-down in a few days, and the hard
>>>>> one
>>>>> a week later. At this point, there isn't much reason to keep
>>>>> frantically
>>>>> improving the algorithm (which is never perfect), so you are right not
>>>>> planning on it.
>>>>>
>>>>> In the remaining time I suggest to
>>>>>
>>>>> - address the points above;
>>>>> - make sure that everything that should be configurable is
>>>>> configurable
>>>>> (algorithm, mode, learning set, db connection details);
>>>>> - create structures to store the metrics and reading to/writing from
>>>>> the
>>>>> database;
>>>>> - make sure the predicting and the calculating part can be called
>>>>> separately;
>>>>> - update documentation, clean up logging and code in general.
>>>>>
>>>>> As long as we have these two parts easily callable, we will find a
>>>>> place
>>>>> in buildbot/MTR to put them to, so don't waste too much time on it.
>>>>>
>>>>> Regards,
>>>>> Elena
>>>>>
>>>>>
>>>>>
>>>>>> Regards
>>>>>> Pablo
>>>>>>
>>>>>>
>>>>>
>>>>
>>
>


Follow ups

References