dolfin team mailing list archive

Thread
Date

Re: Results: Parallel speedup

To: dolfin-dev@xxxxxxxxxx
From: Niclas Jansson <njansson@xxxxxx>
Date: Tue, 22 Sep 2009 08:45:49 +0200
Delivered-to: dolfin-dev@xxxxxxxxxx
In-reply-to: <20090922061803.GA14434@olorin> (Anders Logg's message of "Tue\, 22 Sep 2009 08\:18\:03 +0200")
User-agent: Gnus/5.11 (Gnus v5.11) Emacs/22.2 (gnu/linux)

Anders Logg <logg@xxxxxxxxx> writes:

> On Tue, Sep 22, 2009 at 08:11:27AM +0200, Niclas Jansson wrote:
>> Matthew Knepley <knepley@xxxxxxxxx> writes:
>>
>> > On Mon, Sep 21, 2009 at 2:37 PM, Anders Logg <logg@xxxxxxxxx> wrote:
>> >
>> >     Johan and I have set up a benchmark for parallel speedup in
>> >
>> >      bench/fem/speedup
>> >
>> >     Here are some preliminary results:
>> >
>> >      Speedup  |  Assemble  Assemble + solve
>> >      --------------------------------------
>> >      1        |         1                 1
>> >      2        |    1.4351            4.0785
>> >      4        |    2.3763            6.9076
>> >      8        |    3.7458            9.4648
>> >      16       |    6.3143            19.369
>> >      32       |    7.6207            33.699
>> >
>> > These numbers are very very strange for a number of reasons:
>> >
>> > 1) Assemble should scale almost perfectly. Something is wrong here.
>> >
>> > 2) Solve should scale like a matvec, which should not be this good,
>> >     especially on a cluster with a slow network. I would expect 85% or so.
>> >
>> > 3) If any of these are dual core, then it really does not make sense since
>> >     it should be bandwidth limited.
>> >
>> >   Matt
>> >  
>>
>> So true, these numbers are very strange. I usually get 6-7 times speedup
>> for the icns solver in unicorn on a crappy intel bus based 2 x quad core.
>>
>> A quick look at the code, is the mesh only 64 x 64? This could (does) explain
>> the poor assembly performance on 32 processes (^-^)
>
> It's 64 x 64 x 64 (3D). What would be a reasonable size?
>

Ok, my bad, but maybe 128 x 128 x 128 or even (256) would perform better.

I usually choose the size such that each process has a mesh equal to a
reasonable expensive serial job (when using all of them).

Niclas

>> Also, I think the timing is done in the wrong way. Without barriers, it
>> would never measure the true parallel runtime.
>>
>> MPI_Barrier
>> MPI_Wtime
>> number crunching
>> MPI_Barrier
>> MPI_Wtime
>>
>> (Well assemble is more or less an implicit barrier due to apply(), but I
>> don't think solvers has some kind of implicit barriers)
>
> I thought there were implicit barriers in both assemble (apply) and
> the solver, but adding barriers would not hurt.
>
> --
> Anders
> _______________________________________________
> DOLFIN-dev mailing list
> DOLFIN-dev@xxxxxxxxxx
> http://www.fenics.org/mailman/listinfo/dolfin-dev

References

Results: Parallel speedup
From: Anders Logg, 2009-09-21
Re: Results: Parallel speedup
From: Matthew Knepley, 2009-09-21
Re: Results: Parallel speedup
From: Niclas Jansson, 2009-09-22
Re: Results: Parallel speedup
From: Anders Logg, 2009-09-22