launchpad-dev team mailing list archive
-
launchpad-dev team
-
Mailing list archive
-
Message #03878
architecture review progress
I suspect some of my recent emails have seemed to be jumping all over
the place - and at the surface that is so. However they are all tied
together at a lower layer - I don't want to cause confusion, so its
time to tie them together and try to share the patterns I'm seeing as
I review our architecture during this early boot-strap period. I'm
largely seeing things on-demand as issues crop up, but nevertheless
I'm getting (I think) a decent coverage.
The brilliant :
- many folk have come up to me and said words roughly equivalent to
'wow, I thought I was alone caring about performance/downtime/etc' :-
with the number of folk that want to really have us shine in this
area, I have /no/ doubt that we'll achieve it. Launchpad is no worse
off than bzr was back before performance was made into a key
development metric, and like bzr, I expect a rapid improvement in
Launchpad as we start to assess things more critically.
- Launchpad /is/ very functional and does many things its users want.
So much so that users want to add more and more things to Launchpad :)
- this was a common theme at the Epic. I look forward to having our
system so good that we can rapidly serve these user requests.
The good:
- we have some very powerful diagnostic tools, and they are improving.
- much of our system has a solid scaling and availability story: we
only have ~ 5 action items to get no-downtime upgrades, and only one
of them needs non-trivial development.
- Our code base is really very approachable; for all that its of a
fairly decent size, the chains to find causes of issues are pretty
shallow.
The bad:
- we have immensely strong coupling occuring in the system. Recently
observed pain points:
* The DB uses triggers which makes the ORM <-> DB layer more fragile
and less direct (10 hours of testfix due to a storm bug only possible
with triggers)
* It is non-trivial to do out-of-transaction events: actions are
very tightly coupled to their context, either in the DB or in the
webapp. The jobs systems are of sufficiently high friction that they
aren't the first tool developers reach for, and so they aren't
immediately useful
* Actioning a configuration change takes approximately 2 hours,
unless the stock process is bypassed, in which case it only takes 15
minutes!
- related to the coupling story we are missing fairly standard
infrastructure for an internet scale system: a queuing system (Jeroen
is a great person to talk to about rabbit, with his MQSeries
experiences); high relevance searching; system status dashboard;
automated rollouts; write-scalablilty [e.g. sharding/partitioning];
callbacks to user code. Many of these are coming, or having
requirements assessed at the moment.
The ugly:
- we have really high friction around making changes, which leads to
both not doing small twaks and big changes which are high risk which
leads to.. more friction. The new merge and deployment stories will
help a lot, but also, I think we need to really just make it easy to
improve things. Curtis gave a great lightning talk at the Epic
covering how small changes led to him doing the most bug fixes per
month-long cycle: we should all do more of that.
- We have interlinked performance problems; the DB is a choke point
for writes, and we write a lot - enough that when a backup goes wrong,
we have a timeout spike on lpnet and edge, because we have little
headroom. Queries that take 6000ms on staging (when in cache) take
14000ms on prod slaves, and 24000ms or more on prod main : we're
running into contention : we have so much load things are slower just
because of the load. And, we have operations that take seconds to
complete, which adds to the load. Further because things are slow, its
very hard to spot new slowdowns, because the situation normal is slow.
- we have pages for which the minimum time to complete is more than 5
seconds on the server. Server render time is not a great surrogate for
user experience - there are many things which can go wrong when
delivering stuff to users; however great server render times are a
necessary condition to a great user experience.
- we have baked-in scalability issues in some areas, which will
require time to track down and fix. I'm going to put the design
guidelines I proposed at the Epic online next week, and after that
start working on scaling/performance guidelines as a specific
subtopic.
I hope the above all ties together well; the emails about different
bits of the system I've been sending out have been largely driven by
specific scaling issues I've uncovered as I dig into the search
performance / relevance story. The specific things I'm suggesting
changes to are things where Launchpad is slow *because* of how we've
solved engineering / design challenges, rather than because of the
sheer number of users we have. As I said at the Epic, if we don't
focus, we'll churn and have a hard time doing anything; however when
there are multiple interlocking causes prevent a problem being solved,
we will need to spread out and solve them: like Stop-the-line in LEAN,
the first *really fast, scalable* thing in a system is the hardest.
(excluding +opstats, ok?).
Right now, my personal focus is on three things, with no well defined
priority between them:
- lowering the hard timeout [ensuring we don't have requests hogging
resources, failing-faster-when-we-fail, giving us a back-stop to
prevent creeping slowness]
- search performance [one of the key pages that fails a lot and is
blocking hard timeout lowering is searching]
- our development story [the slower we iterate, the slower we
improve] This includes the RFWTAD QA/deployment story, the new landing
system Gary proposed, test suite overhead etc]
Of course, I have a forth thing, which is more important than those
three: helping you guys solve problems in design or implementation;
I've done a bit of this so far, and I'm keen to do more.
Cheers,
Rob
Follow ups