← Back to team overview

ubuntu-phone team mailing list archive

Re: A new Image release Proposal

 

On Fri, Nov 29, 2013 at 1:01 PM, Alexander Sack <asac@xxxxxxxxxxxxx> wrote:
> On Fri, Nov 29, 2013 at 12:41 PM, Oliver Grawert <ogra@xxxxxxxxxx> wrote:
>> hi,
>> On Fr, 2013-11-29 at 11:32 +0100, Alexander Sack wrote:
>>> Hi,
>>>
>>> it seems you put a few changes up for discussion in one shot.
>>>
>>> Let's keep those separate and look at them one by one:
>>>
>>> >From what I see you basically propose three main things:
>>>
>>>  1. lets increase velocity of image production so we get 2-3 images
>>> produced in devel-proposed per day
>>>  2. make cron the technology we use to schedule and kick those images
>>> 2-3 times a day
>>>  3. increase manual testing done before "releasing" images create a
>>> broader touch-release team that will include avengers and manual
>>> testers and community etc.
>>>
>>> Let me look at them one by one and then give a bullet summary of what
>>> I believe we should indeed tweak for now...
>>>
>>> On 1.
>>> ======
>>>
>>> I think 1. is and was the goal. So I think noone disagrees with the
>>> benefits of having 2-3 checkpoints a day and we should just do it.
>>> Note: it actually always was that way when I ran the landing team and
>>> during release time. I believe we still do it, but if we don't we
>>> should certainly ensure that we get back to do this.
>> on the majority of days in the past we only had one image build per day
>> simply because there were to many landings to wait for and in the end we
>> had huge change sets that burned a lot of manpower when searching where
>> a regression comes from.
>>
>
> Let's fix that process problem first.
>

BTW, I got pointed to the fact that there is no real data to support
that there is a problem. I checked quickly for this week (the first
week with CI engine operational) and the week before the CI engine
went down. Here the data:

proposed images produced this week:

 Monday: 1 (CI engine came back)
 Tuesday: 3
 Wed: 2
 Thu: 2
 Fri: 1 (another one coming)

proposed images produced the week beforee the CI engine went down.

   3, 3, 2, 3, 1 (last day engine went down half way through)

So yes, we should have continued producing images at that rate when
the engine was down and yes, we can do better at scheduling and
communicating predictable image time windows.

However, I don't see data that there is a real issue on image
production when we use our smart landing team to schedule and trigger
image production.


> All we need to do is to be strict about following the time windows for
> cutting images regardless of whether the image has a chance to get
> promoted or not. We haven't spelled things out like this before, so I
> am pretty confident that this discussion helped getting us there.
>
>
>>>
>>> On 2.
>>> ======
>>>
>>> You are suggesting a technical solution to the problem "how and when
>>> do we cut images".
>>>
>>> I don't see why we would go for cron if we have something that is
>>> smarter - e.g. our landing process. It would be a big step back to do
>>> that. Let's be smarter :)...
>>>
>>> What we did during the final weeks of release and what we should
>>> continue to do (until we have trigger based image production) was to
>>> cut images based on a smart, individual landing plan that doesn't use
>>> a strict time approach, but rather a hybrid approach that also takes
>>> landing goals into account also
>>>
>>> For instance, every morning, landing team looks at the work to do and
>>> decides what chunks of work we would like to have in image 1,2,3...
>>> then they set themselves a hard end time to avoid that we drag on
>>> without images forever. This worked pretty well.
>>>
>>> On top we should ensure that we continue producing images also during
>>> times where landing team does not operate. That's mostly on weekend,
>>> but also might be during eur/US nights. For those times we can use
>>> cron to compensate the lack of available brains :0
>>>
>>
>> we should have a fixed cron schedule even if the landing team is around,
>> it is a huge pain if the change sets get bigger, how about we have one
>> or two fixed cron builds per day and still the opportunity to trigger a
>> third manual build at will. (the testing infrastructure is still highly
>> unstable and unreliable, tests need to be re-run on nearly every image
>> build, we have two persons doing this in two time zones and just started
>> to discuss a cron schedule on IRC that makes sure the images are built
>> at a time most convenient for them so we can have images ready during
>> their working hours with enough wiggle room for manually restarting the
>> individual tests that failed or were flaky)
>
> So you don't trust the landing team that they can make and communicate
> a predictable "time window schedule" for cutting images and follow
> that schedule? I totally do believe they can and will do it :)
>
> With that, I can't really see how can you still be unhappy about what
> I propose: we get the goodness of both worlds -> guaranteed frequency,
> predictability, smartness. perfect!
>
>
>>
>>
>>> On 3.
>>> ======
>>>
>>> Your proposal means very different things based on what you call
>>> "image release". So far we have used the word "promotion" to describe
>>> the act of moving a "blessed" image from a -proposed channel to a
>>> non-proposed channel. I am not sure if thats what you call "release"
>>> in your mail, but I assume so...
>> yes, i mean promoting images from -proposed to devel/trusty
>>
>>>
>>> Let's look at the channels and its purposes again:
>>>
>>>  - devel-proposed -> here all images get spit out. they are completely
>>> untested and haven't even run through automation (read: why do you
>>> want to bother big dogfooders and avengers by telling them to test
>>> this stuff)
>> because lots of regressions go out unnoticed, these images see automated
>> tests in a system that isn't very reliable yet, beyond that they get a
>> minimal smoke test (usually done by popey and me) that only covers as
>> much as we invest time ...
>> that method is not covering any regressions that show up after a while
>> only or that a manual smoketest simply didn't catch.
>>
>> we have a big community of people out there running the -proposed image
>> (I would say even more than people that actually use the devel channel),
>> we should give them a platform to be able to give us feedback and
>> participate in testing and bug triage for better regression detection.
>> locking them out by having team-only hangout meetings can't be the
>> solution to open development IMHO, lets open up to the community again
>> please.
>>
>>>  - devel -> here we put images that have gone through automation and
>>> that are ready for dogfooders to pick up
>>>  - stable -> here is where we have end users and deliver updates to
>>> end users through it.
>>>
>>> Now the consent on target frequency of those is:
>>>
>>>  - devel-proposed == 2-3 times a day (automated testing only)
>>>  - devel == 1+ times a day (dogfooders and avengers testing with goal
>>> to drive us to next stable update)
>>>  - stable == 1-6 monthlty (stable users will give even more "testing")
>>>
>>> I think that all makes sense, and doesnt' really need changing?
>> given that our automated tests cant even catch any GSM and SMS issues I
>> don't see how this all "makes sense". we have people out there using
>> these images, lets get their feedback, have them help and
>> participate ...
>>
>>>
>>> What needs better organization is the testing of dogfooders and
>>> avengers of "already blessed" devel images. Here your idea about a
>>> touch-release team makes sense. So far we had delegated that to jfunk.
>>> You could help him organize a more effective avengers effort that also
>>> includes the community, so maybe talk to him.
>>
>> how does that help at all to prevent us from promoting images
>> with regressions ?
>> having the avengers test the images is nice and all and will give us a
>> good set of high level bugs but it does not at all help with the issue
>> that we need to improve the promotion process ... automation can only
>> cover a small part here, lets involve our community in our processes
>> when we can ...
>>
>> getting bugs only from the avengers for slipped regressions also means
>> there is quite a delay sometimes so the respective package/code base has
>> evolved a lot already and when we try to nail down the issue we need to
>> do archeology. I was hoping that we could win some agility back with my
>> proposal of a public facing touch-release team, our current processes
>> are very slow and add a lot of delay everywhere while not really
>> improving the quality IMHO.
>>
>>>
>>> -----
>>>
>>> OK, let's summarize what we got so far and let's do the following
>>> tweaks for now...
>>>
>>>
>>> Summary
>>> =========
>>>
>>>  1. we start producing 2 images a day until end of year at a
>>> predicable schedule (didrocks will announce that schedule after
>>> discussing internally)
>>>
>>>  2. we don't enable cron during business days. Instead we hook image
>>> kicks up to our landing process so that we get a smart, but predicable
>>> schedule
>>>     - for instance, the times of image build will always happen around
>>> the same hours (e.g. image 1: 1200-1400, image2: 1800-2000) the same
>>> timeframe, but also will be smart about considering the landing
>>> payload so we can ensure that the critical pieces really landed etc.
>>
>> why does that matter at all ?
>>
>> if the landing is ready it enters the archive and will be automatically
>> in the next build, no matter when that build was done. the proposed
>> migration of the archive makes sure the set of packages will land
>> together (if you have packages slipping through then there is a
>> packaging bug that needs fixing)
>> I never saw (and still don't see) why image production should be tied
>> into landings at all, landings are held back in the infrastructure
>> automatically until they are complete.
>>
>>>
>>>  3. to keep the image frequency acceptable at all times, we enable
>>> cron builds during weekend and days where landing team is not
>>> operational.
>> ...
>>>
>>>  4. ogra and team to help jfunk to organize a more vibrant avengers
>>> community around testing of images after devel promotion; this team
>>> has the goal to identify issues that would block a stable promotion
>>> and will be fed back into the landing team so they can prioritize
>>> landings with the goal to clear a new stable promotion.
>>
>> you missed my most important point of involving the community by having
>> open and regular IRC meetings.
>> testing *after* promotion ... while it is nice and generates good
>> bugs ... doesn't help at all with preventing regressions from slipping
>> into the promoted images.
>>
>> note that all images we promoted since r10 had and still have
>> regressions (not to mention that the automated test pass rate is far
>> below r10 too)
>>
>> so to summarize from my side:
>>
>> 1+2+3) lets go with a semi automatic schedule then, so we even get
>> images if all ubuntu-cdimage members that can trigger them are run down
>> by simultaneous buses on different continents on the same day ;)
>> (I still disagree that the landing team is the right team to drive image
>> stuff and I also still think it puts extra load on them that should
>> better be invested into processing more landings per day. our
>> infrastructure is designed in a way that landings are held back until
>> they are installable, there is no need to bind image builds to landings.
>> we add an artificial blocker to the process were we have a reliable and
>> automatic one in place since years)
>>
>> 4) lets still have a more open regression testing process that involves
>> the community more to make sure regressions do not slip into promoted
>> images.
>>
>> ciao
>>         oli


Follow ups

References