launchpad-dev team mailing list archive

Thread
Date
Team Leads meeting 2010-11-17

To: Launchpad Community Development Team <launchpad-dev@xxxxxxxxxxxxxxxxxxx>
From: Curtis Hovey <curtis.hovey@xxxxxxxxxxxxx>
Date: Wed, 17 Nov 2010 13:40:28 -0500
Organization: Canonical Ltd.
Reply-to: curtis.hovey@xxxxxxxxxxxxx
Team Leads meeting 2010-11-17
=============================

    https://wiki.canonical.com/Launchpad/TeamLeadMeetings
    chair: sinzui

    next meeting 2010-11-24 17:00 UTC, chair thumper


"Good news and Pleasant surprises", capped at 15 minutes

    Curtis Hovey
    * Foundations has 150 fewer bugs. Launchpad web and registry have gained
      about 25 bugs each.

    Danilo Šegan
    * The pofile+translate timeout fixed.

    Diogo Matsubara
    * Oops summaries are readable now that deryck fixed the checkwatches bug.

    Deryck Hodge
    * No good news.

    Francis Lacoste
    * New card id field in kanban that lets us link to an lp bug.

    Henning Eggers
    * Rollout successful. It took less than an hour.

    Jonathan Lange
    * We will make an offer to a candidate for the front-end engineer and
      designer position.
    * Completed a series of blog posts.

    Julian Edwards
    * Build manager is in production and is working (mostly)

    Tim Penhey
    * private xmlrpc server is load balanced. No more timeouts.


Apologies

    Gary Poster
    Robert Collins
    Elliot Murphy


Action items

    ACTION: Gary to take discussion about how to improve the staging DB
    reliability to the list. Done.

    ACTION: Francis and Robert to come up with a plan on automating requests
    for a rollout. In discussion.

    ACTION: Danilo to invite Henning for the next meeting. Done.

    ACTION: TLs to assign a bug tag for all in-progress features to respective
    LEPs. Done.

    ACTION: Francis to send an email describing policy for triaging bugs
    critical for deployment of a feature (bug-tag + high priority), Done.


Post-release retrospective

    https://wiki.canonical.com/Launchpad/RolloutReports/10.11
    Henning:
        QA was complete by Monday morning. Continuous deployments are
        an excellent incentive to do QA early. There were a few issues during
        the rollout that Tom identified and fixed immediately.

        Tom asked what happens when we open PQM early *and* we discover we
        need to land a fix for the release?

        QA state was misleading, the builder code did not work on cesium.

    Julian
        explained it was tested for several days and was placed under load.
        The real issue is that production is configured differently from
        dogfood, it has different hardware and loads conditions.

    Jonathan
        We do not have a good QA environment for the build farm?

    Julian and Francis
        Dogfood is good, it could be better. Production will always be
        different, and we are always changing the code and environments.
        QA can fail, and we can roll the change back. Being able to respond
        quickly is very important. *When we do a release with down time*
        (a db-devel release) we cannot rollback quickly.

    Jonathan
        Was this less stressful because most the changes were already
        released?

    Francis
        Since QA is timely, there is less work for the RM to do--most of the
        RM time is spent chasing QA issues.

    Francis and Henning
        There was a backup job running that was blocking. The job was killed


QA issue

    https://bugs.launchpad.net/launchpad-registry/+bug/676477
    https://bugs.launchpad.net/launchpad-foundations/+bug/676489
    https://bugs.launchpad.net/launchpad-code/+bug/676495


Sprints, events, and conferences
    see https://wiki.canonical.com/Launchpad/Sprints

    Robert, Gary, and Elliot are attending Casandra

    Danilo will be away for 3 weeks.

    Jonathan created https://dev.launchpad.net/BugJam


Infrastructure issues Next meeting is November 25th.


Release Manager in the call *before* the release (Danilo)
    The RM is invited to intend in week 3 and week 4.


Reminder about crisis handling policy (flacoste)

    Francis
        Kiko had a way to make a production incident a big deal. We have
        a criss handling policy, but we are not following it. If the policy
        is broken, we must fix it.

        We could have a standing TL topic to review incident reports so that
        we know the policy works, that issues are being dealt with, and

    TLs The new topic is accepted!

    Johnathan
        We need to simplify/clarify the definition of critical. Critical
        for a team is not critical for all Lp.

    Danilo and Francis
        We need clarification when the issue is handed to someone else.
        The timeline is tedious to maintain, but it is also where we loose
        information during hand offs. Maybe someone should be keeping notes
        so that the people working the issue are not distracted? Is this
        really worthwhile if the incident goes on for hours or days.

    Tim, Julian, and Francis
        Hand off for the the buildd incident was done by a community
        contributor (William) because he was to most capable. Our teams are
        organised by domain knowledge and we are clustered in time zones.

    Danilo and Francis
        The process does not scale, William does not scale.

    Julian and Danilo
        Julian ponders team rotations will help address the domain knowledge
        issue. Francis is already planning something on this matter.

    ACTION: Robert will propose a new definition of critical.
    ACTION: Francis will revise the hand off definition


Incident review (flacoste)

    IncidentReports/2010-11-11-LP-Bzr-Leaked-Stack-Trace Recommendations
        Don't leak confusing traceback message to the bzr client.
            Bug #675517

        Remove the private XML-RPC bottleneck. That is tracked in
            RT #41465 DONE

        Review deployment architecture for other potential bottleneck and
        single-point of failure. DONE

        Revise critical policy to cover this kind of incidents.
            RobertCollins to frame discussion and initiate. (action above)

        Remind about and enforce the existing policies about
            Francis sent an email

        crisis handling and incident communication
        while its somewhat coincidental, a shorter LOSA work queue (e.g. if
        there were more LOSAs so items go through more quickly) would have
        meant we never had this crisis - when capacity / robustness planning
        lead to ticket 41465 being filed (and prioritised because we saw
        things correlated with request volume) (The initial filing talks about
        'being serviced by the entire server farm').

        The devs looking at the xmlrpc performance prior to the meltdown could
        have been a lot clearer about what they were seeing - and done a risk
        analysis about it which would have turned up the bzr connection.

    Tim
        Without access to logs it was hard to see what has happening. We could
        not see that the servers were really under load.

    ACTION: Francis to investigate the visibility into the length of request
        queues. This could be graphed. This will show us when the app
        servers are being overloaded.
-- 
__Curtis C. Hovey_________
http://launchpad.net/
Attachment: signature.asc
Description: This is a digitally signed message part
Follow ups

Re: Team Leads meeting 2010-11-17
From: Jonathan Lange, 2010-11-19
Re: Team Leads meeting 2010-11-17
From: Robert Collins, 2010-11-18