On Tue, Sep 22, 2009 at 08:11:27AM +0200, Niclas Jansson wrote:
Matthew Knepley <knepley@xxxxxxxxx> writes:
On Mon, Sep 21, 2009 at 2:37 PM, Anders Logg <logg@xxxxxxxxx> wrote:
Johan and I have set up a benchmark for parallel speedup in
bench/fem/speedup
Here are some preliminary results:
Speedup | Assemble Assemble + solve
--------------------------------------
1 | 1 1
2 | 1.4351 4.0785
4 | 2.3763 6.9076
8 | 3.7458 9.4648
16 | 6.3143 19.369
32 | 7.6207 33.699
These numbers are very very strange for a number of reasons:
1) Assemble should scale almost perfectly. Something is wrong here.
2) Solve should scale like a matvec, which should not be this good,
especially on a cluster with a slow network. I would expect 85% or so.
3) If any of these are dual core, then it really does not make sense since
it should be bandwidth limited.
Matt
So true, these numbers are very strange. I usually get 6-7 times speedup
for the icns solver in unicorn on a crappy intel bus based 2 x quad core.
A quick look at the code, is the mesh only 64 x 64? This could (does) explain
the poor assembly performance on 32 processes (^-^)
It's 64 x 64 x 64 (3D). What would be a reasonable size?
Also, I think the timing is done in the wrong way. Without barriers, it
would never measure the true parallel runtime.
MPI_Barrier
MPI_Wtime
number crunching
MPI_Barrier
MPI_Wtime
(Well assemble is more or less an implicit barrier due to apply(), but I
don't think solvers has some kind of implicit barriers)
I thought there were implicit barriers in both assemble (apply) and
the solver, but adding barriers would not hurt.