launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #05700
Team Leads meeting 2010-11-17
Team Leads meeting 2010-11-17
=============================
https://wiki.canonical.com/Launchpad/TeamLeadMeetings
chair: sinzui
next meeting 2010-11-24 17:00 UTC, chair thumper
"Good news and Pleasant surprises", capped at 15 minutes
Curtis Hovey
* Foundations has 150 fewer bugs. Launchpad web and registry have gained
about 25 bugs each.
Danilo Šegan
* The pofile+translate timeout fixed.
Diogo Matsubara
* Oops summaries are readable now that deryck fixed the checkwatches bug.
Deryck Hodge
* No good news.
Francis Lacoste
* New card id field in kanban that lets us link to an lp bug.
Henning Eggers
* Rollout successful. It took less than an hour.
Jonathan Lange
* We will make an offer to a candidate for the front-end engineer and
designer position.
* Completed a series of blog posts.
Julian Edwards
* Build manager is in production and is working (mostly)
Tim Penhey
* private xmlrpc server is load balanced. No more timeouts.
Apologies
Gary Poster
Robert Collins
Elliot Murphy
Action items
ACTION: Gary to take discussion about how to improve the staging DB
reliability to the list. Done.
ACTION: Francis and Robert to come up with a plan on automating requests
for a rollout. In discussion.
ACTION: Danilo to invite Henning for the next meeting. Done.
ACTION: TLs to assign a bug tag for all in-progress features to respective
LEPs. Done.
ACTION: Francis to send an email describing policy for triaging bugs
critical for deployment of a feature (bug-tag + high priority), Done.
Post-release retrospective
https://wiki.canonical.com/Launchpad/RolloutReports/10.11
Henning:
QA was complete by Monday morning. Continuous deployments are
an excellent incentive to do QA early. There were a few issues during
the rollout that Tom identified and fixed immediately.
Tom asked what happens when we open PQM early *and* we discover we
need to land a fix for the release?
QA state was misleading, the builder code did not work on cesium.
Julian
explained it was tested for several days and was placed under load.
The real issue is that production is configured differently from
dogfood, it has different hardware and loads conditions.
Jonathan
We do not have a good QA environment for the build farm?
Julian and Francis
Dogfood is good, it could be better. Production will always be
different, and we are always changing the code and environments.
QA can fail, and we can roll the change back. Being able to respond
quickly is very important. *When we do a release with down time*
(a db-devel release) we cannot rollback quickly.
Jonathan
Was this less stressful because most the changes were already
released?
Francis
Since QA is timely, there is less work for the RM to do--most of the
RM time is spent chasing QA issues.
Francis and Henning
There was a backup job running that was blocking. The job was killed
QA issue
https://bugs.launchpad.net/launchpad-registry/+bug/676477
https://bugs.launchpad.net/launchpad-foundations/+bug/676489
https://bugs.launchpad.net/launchpad-code/+bug/676495
Sprints, events, and conferences
see https://wiki.canonical.com/Launchpad/Sprints
Robert, Gary, and Elliot are attending Casandra
Danilo will be away for 3 weeks.
Jonathan created https://dev.launchpad.net/BugJam
Infrastructure issues Next meeting is November 25th.
Release Manager in the call *before* the release (Danilo)
The RM is invited to intend in week 3 and week 4.
Reminder about crisis handling policy (flacoste)
Francis
Kiko had a way to make a production incident a big deal. We have
a criss handling policy, but we are not following it. If the policy
is broken, we must fix it.
We could have a standing TL topic to review incident reports so that
we know the policy works, that issues are being dealt with, and
TLs The new topic is accepted!
Johnathan
We need to simplify/clarify the definition of critical. Critical
for a team is not critical for all Lp.
Danilo and Francis
We need clarification when the issue is handed to someone else.
The timeline is tedious to maintain, but it is also where we loose
information during hand offs. Maybe someone should be keeping notes
so that the people working the issue are not distracted? Is this
really worthwhile if the incident goes on for hours or days.
Tim, Julian, and Francis
Hand off for the the buildd incident was done by a community
contributor (William) because he was to most capable. Our teams are
organised by domain knowledge and we are clustered in time zones.
Danilo and Francis
The process does not scale, William does not scale.
Julian and Danilo
Julian ponders team rotations will help address the domain knowledge
issue. Francis is already planning something on this matter.
ACTION: Robert will propose a new definition of critical.
ACTION: Francis will revise the hand off definition
Incident review (flacoste)
IncidentReports/2010-11-11-LP-Bzr-Leaked-Stack-Trace Recommendations
Don't leak confusing traceback message to the bzr client.
Bug #675517
Remove the private XML-RPC bottleneck. That is tracked in
RT #41465 DONE
Review deployment architecture for other potential bottleneck and
single-point of failure. DONE
Revise critical policy to cover this kind of incidents.
RobertCollins to frame discussion and initiate. (action above)
Remind about and enforce the existing policies about
Francis sent an email
crisis handling and incident communication
while its somewhat coincidental, a shorter LOSA work queue (e.g. if
there were more LOSAs so items go through more quickly) would have
meant we never had this crisis - when capacity / robustness planning
lead to ticket 41465 being filed (and prioritised because we saw
things correlated with request volume) (The initial filing talks about
'being serviced by the entire server farm').
The devs looking at the xmlrpc performance prior to the meltdown could
have been a lot clearer about what they were seeing - and done a risk
analysis about it which would have turned up the bzr connection.
Tim
Without access to logs it was hard to see what has happening. We could
not see that the servers were really under load.
ACTION: Francis to investigate the visibility into the length of request
queues. This could be graphed. This will show us when the app
servers are being overloaded.
--
__Curtis C. Hovey_________
http://launchpad.net/
Attachment:
signature.asc
Description: This is a digitally signed message part
Follow ups