← Back to team overview

launchpad-dev team mailing list archive

Performance Tuesday: deep-narrow or shallow-wide but never deep-wide (real version)

 

Bad email client! bad!. (I sent this half-edited before. Sorry!)

Jeroen or Stuart are probably the best folk to ask about this topic :)
However on performance Tuesday this week I ran into a really clear
example...

The subject refers to how to get fast queries out of postgres: ask
queries that consult a lot of data from one table, or consult little
data from many tables, but never a lot of data from many tables.

In bug 787294 I ran into something which may be a case of this. I also
had a full-day
headache which led to some interesting analysis mistakes ;).

For instance, this humdinger candidate query plan:
16:47 #launchpad-dev: < lifeless>  Nested Loop
(cost=2.76..17359385328490.43 rows=247357031677542 width=4)

Anyhow, in this bug we're querying for all the bugs that the team is
related to (assigned, subscribed, created, commented or structurally
subscribed) to which have patches. The count of patches needed is 102.

The current live plan has an estimated cost of 800K:
Aggregate (cost=816866.73..816866.74 rows=1 width=0)
  -> Nested Loop (cost=705926.62..816816.29 rows=20177 width=0)
        -> HashAggregate (cost=705926.62..706128.39 rows=20177 width=280)
              -> Append (cost=0.00..704564.67 rows=20177 width=280)

with the bulk of the cost coming in from the structural subscription subplan:
                 -> Nested Loop (cost=43392.43..702903.59 rows=20168 width=280)
                          Join Filter: ((public.bugtask.product =
structuralsubscription.product) OR (public.bugtask.productseries =
structuralsubscription.productseries) OR ((public.product.project =
structuralsubscription.project) AND (public.bugtask.product =
public.product.id)) OR ((public.bugtask.distribution =
structuralsubscription.distribution) AND
((public.bugtask.sourcepackagename =
structuralsubscription.sourcepackagename) OR
(structuralsubscription.sourcepackagename IS NULL))) OR
(public.bugtask.distroseries = structuralsubscription.distroseries) OR
(public.bugtask.milestone = structuralsubscription.milestone))
...

I ran the original query on qastaging, and stopped it after 45 minutes.

Now, qastaging is missing an index that production and staging has, so
I moved over to staging to test.

I successively refined the queries I ran to exercise smaller and
smaller variables, trying to find what triggered the bad plan. (45
minutes is -much- longer than just doing a raw read of all the data
could explain).

This query:SELECT distinct BugTask.bug FROM BugTask inner join bug on
bug.id=bugtask.bug LEFT JOIN Product ON BugTask.product = Product.id
AND Product.active inner join structuralsubscription ss on (
ss.subscriber = 343381
 and(
        BugTask.product = ss.product
        OR BugTask.productseries = ss.productseries
        OR Product.project = ss.project
        OR BugTask.distribution = ss.distribution
        AND (BugTask.sourcepackagename = ss.sourcepackagename
             OR ss.sourcepackagename IS NULL)
        OR BugTask.distroseries = ss.distroseries
        OR BugTask.milestone = ss.milestone)
)
        WHERE ((BugTask.status = 10) OR (BugTask.status = 15) OR
(BugTask.status = 20) OR (BugTask.status = 21) OR (BugTask.status =
22) OR (BugTask.status = 25))
          AND Bug.duplicateof IS NULL
          AND Bug.latest_patch_uploaded IS NOT NULL
          AND Bug.private = FALSE;

which checks all the structural subscriptions has an estimated cost of 110K:
 HashAggregate (cost=112166.83..112397.51 rows=23068 width=4)
and runs in 88000 ms.

This query:
SELECT distinct BugTask.bug FROM BugTask inner join bug on
bug.id=bugtask.bug LEFT JOIN Product ON BugTask.product = Product.id
AND Product.active inner join structuralsubscription ss on (
ss.subscriber = 343381
 and(
        BugTask.distribution = ss.distribution
        AND (BugTask.sourcepackagename = ss.sourcepackagename
             OR ss.sourcepackagename IS NULL)

        )
)
        WHERE ((BugTask.status = 10) OR (BugTask.status = 15) OR
(BugTask.status = 20) OR (BugTask.status = 21) OR (BugTask.status =
22) OR (BugTask.status = 25))
          AND Bug.duplicateof IS NULL
          AND Bug.latest_patch_uploaded IS NOT NULL
          AND Bug.private = FALSE;
has an estimated cost of 130K:
 HashAggregate  (cost=131351.38..133944.19 rows=259281 width=4)
and runs in 900ms.

The only difference is checking additional fields in the structural
subscription.

Now, the full check is triggering a pathological behaviour in
postgresql - it may even be a bug. The differing join filter (perhaps
due to a statistics triggered thing) changes the plan from walking the
subscribed bugtargets (slow case) to walking bugs with patches and
filtering down to subscriptions (faster case). AIUI this is what
happens with deep-wide queries: the planner eventually decides we'll
see all rows and starts walking from the smallest table, which is
utterly wrong when we'll reenter large related tables: and the server
team has 150 structural subscriptions.

So the question became, how to express the query such that postgresql
won't be so confused.

This query (https://bugs.launchpad.net/launchpad/+bug/787294/comments/11)
does that:

with ss as (select * from structuralsubscription where subscriber=343381)
SELECT distinct BugTask.bug FROM BugTask inner join bug on
bug.id=bugtask.bug LEFT JOIN Product ON BugTask.product = Product.id
AND Product.active left join ss ss1 on BugTask.product = ss1.product
left join ss ss2 on BugTask.productseries = ss2.productseries left
join ss ss3 on Product.project = ss3.project left join ss ss4 on
BugTask.distribution = ss4.distribution
         AND (BugTask.sourcepackagename = ss4.sourcepackagename
              OR ss4.sourcepackagename IS NULL) left join ss ss5 on
BugTask.distroseries = ss5.distroseries left join ss ss6 on
BugTask.milestone = ss6.milestone
         WHERE ((BugTask.status = 10) OR (BugTask.status = 15) OR
(BugTask.status = 20) OR (BugTask.status = 21) OR (BugTask.status =
22) OR (BugTask.status = 25))
           AND Bug.duplicateof IS NULL
           AND Bug.latest_patch_uploaded IS NOT NULL
           AND Bug.private = FALSE
          AND (Bugtask.product IS NULL
                OR Product.active = TRUE)
AND null_count(ARRAY[ss1.id, ss2.id, ss3.id, ss4.id, ss5.id, ss6.id]) < 6;

This has a nearly identical cost:

 HashAggregate  (cost=112243.53..112313.50 rows=6997 width=4)
But runs quickly enough:

Time: 1130.509 ms

The change to a number of shallow joins avoided the confusion and the
result runs much faster... for this case.

Moral of the story: sql is hard :)

-Rob


Follow ups