dolfin team mailing list archive
-
dolfin team
-
Mailing list archive
-
Message #15574
Re: Results: Parallel speedup
-
To:
dolfin-dev@xxxxxxxxxx
-
From:
Niclas Jansson <njansson@xxxxxx>
-
Date:
Tue, 22 Sep 2009 08:45:49 +0200
-
Delivered-to:
dolfin-dev@xxxxxxxxxx
-
In-reply-to:
<20090922061803.GA14434@olorin> (Anders Logg's message of "Tue\, 22 Sep 2009 08\:18\:03 +0200")
-
User-agent:
Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)
Anders Logg <logg@xxxxxxxxx> writes:
> On Tue, Sep 22, 2009 at 08:11:27AM +0200, Niclas Jansson wrote:
>> Matthew Knepley <knepley@xxxxxxxxx> writes:
>>
>> > On Mon, Sep 21, 2009 at 2:37 PM, Anders Logg <logg@xxxxxxxxx> wrote:
>> >
>> > Johan and I have set up a benchmark for parallel speedup in
>> >
>> > bench/fem/speedup
>> >
>> > Here are some preliminary results:
>> >
>> > Speedup | Assemble Assemble + solve
>> > --------------------------------------
>> > 1 | 1 1
>> > 2 | 1.4351 4.0785
>> > 4 | 2.3763 6.9076
>> > 8 | 3.7458 9.4648
>> > 16 | 6.3143 19.369
>> > 32 | 7.6207 33.699
>> >
>> > These numbers are very very strange for a number of reasons:
>> >
>> > 1) Assemble should scale almost perfectly. Something is wrong here.
>> >
>> > 2) Solve should scale like a matvec, which should not be this good,
>> > especially on a cluster with a slow network. I would expect 85% or so.
>> >
>> > 3) If any of these are dual core, then it really does not make sense since
>> > it should be bandwidth limited.
>> >
>> > Matt
>> >
>>
>> So true, these numbers are very strange. I usually get 6-7 times speedup
>> for the icns solver in unicorn on a crappy intel bus based 2 x quad core.
>>
>> A quick look at the code, is the mesh only 64 x 64? This could (does) explain
>> the poor assembly performance on 32 processes (^-^)
>
> It's 64 x 64 x 64 (3D). What would be a reasonable size?
>
Ok, my bad, but maybe 128 x 128 x 128 or even (256) would perform better.
I usually choose the size such that each process has a mesh equal to a
reasonable expensive serial job (when using all of them).
Niclas
>> Also, I think the timing is done in the wrong way. Without barriers, it
>> would never measure the true parallel runtime.
>>
>> MPI_Barrier
>> MPI_Wtime
>> number crunching
>> MPI_Barrier
>> MPI_Wtime
>>
>> (Well assemble is more or less an implicit barrier due to apply(), but I
>> don't think solvers has some kind of implicit barriers)
>
> I thought there were implicit barriers in both assemble (apply) and
> the solver, but adding barriers would not hurt.
>
> --
> Anders
> _______________________________________________
> DOLFIN-dev mailing list
> DOLFIN-dev@xxxxxxxxxx
> http://www.fenics.org/mailman/listinfo/dolfin-dev
References