dolfin team mailing list archive

Thread
Date

Re: Assembly benchmark

To: Matthew Knepley <knepley@xxxxxxxxx>
From: "Garth N. Wells" <gnw20@xxxxxxxxx>
Date: Tue, 22 Jul 2008 09:30:32 +0100
Cc: dolfin-dev@xxxxxxxxxx
Delivered-to: dolfin-dev@xxxxxxxxxx
In-reply-to: <a9f269830807211505v133b74dw93fbf50212d5bc5a@mail.gmail.com>
User-agent: Thunderbird 2.0.0.14 (X11/20080505)



Matthew Knepley wrote:

On Mon, Jul 21, 2008 at 4:48 PM, Anders Logg <logg@xxxxxxxxx> wrote:

On Mon, Jul 21, 2008 at 04:37:28PM -0500, Matthew Knepley wrote:

On Mon, Jul 21, 2008 at 4:35 PM, Anders Logg <logg@xxxxxxxxx> wrote:

On Mon, Jul 21, 2008 at 04:03:11PM -0500, Matthew Knepley wrote:

On Mon, Jul 21, 2008 at 3:55 PM, Matthew Knepley <knepley@xxxxxxxxx> wrote:

On Mon, Jul 21, 2008 at 3:50 PM, Garth N. Wells <gnw20@xxxxxxxxx> wrote:


Anders Logg wrote:

On Mon, Jul 21, 2008 at 01:48:23PM +0100, Garth N. Wells wrote:

Anders Logg wrote:

I have updated the assembly benchmark to include also MTL4, see

   bench/fem/assembly/

Here are the current results:

Assembly benchmark  |  Elasticity3D  PoissonP1  PoissonP2  PoissonP3  THStokes2D  NSEMomentum3D  StabStokes2D
-------------------------------------------------------------------------------------------------------------
uBLAS               |        9.0789    0.45645     3.8042     8.0736  14.937         9.2507        3.8455
PETSc               |        7.7758    0.42798     3.5483     7.3898  13.945         8.1632         3.258
Epetra              |        8.9516    0.45448     3.7976     8.0679  15.404         9.2341        3.8332
MTL4                |        8.9729    0.45554     3.7966     8.0759  14.94          9.2568        3.8658
Assembly            |         7.474    0.43673     3.7341     8.3793  14.633         7.6695        3.3878


I specified in MTL4Matrix maximum 30 nonzeroes per row, and the results
change quite a bit,

 Assembly benchmark  |  Elasticity3D  PoissonP1  PoissonP2  PoissonP3
THStokes2D  NSEMomentum3D  StabStokes2D

-------------------------------------------------------------------------------------------------------------
 uBLAS               |        7.1881    0.32748     2.7633     5.8311
    10.968         7.0735        2.8184
 PETSc               |        5.7868    0.30673     2.5489     5.2344
    9.8896          6.069        2.3661
 MTL4                |        2.8641    0.18339     1.6628     2.6811
    2.8519         3.4843       0.85029
 Assembly            |        5.5564    0.30896     2.6858     5.9675
    10.622         5.7144        2.4519


MTL4 is a lot faster in all cases.

Okay, if you run KSP ex2 (Poisson 2D) and add a logging stage that
times assembly (I checked it in to petsc-dev)
then 1M unknowns takes about 1s

  Matrix Object:
    type=seqaij, rows=1000000, cols=1000000
    total: nonzeros=4996000, allocated nonzeros=5000000
      not using I-node routines
Summary of Stages:   ----- Time ------  ----- Flops -----  ---
Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts
%Total     Avg         %Total   counts   %Total
 0:      Main Stage: 1.4997e+00  56.3%  3.8891e+08 100.0%  0.000e+00
0.0%  0.000e+00        0.0%  2.200e+01  51.2%
 1:        Assembly: 1.1648e+00  43.7%  0.0000e+00   0.0%  0.000e+00
0.0%  0.000e+00        0.0%  0.000e+00   0.0%

I just cut the solve off. Thus all thos enumber are extemely fishy.

  Matt

We shouldn't trust those numbers just yet. Some of it may be Python
overhead (calling the FFC JIT compiler etc).

Does 1M unknowns mean a unit square divided into 2x1000x1000 right
triangles?

Its FD Poisson, which gives the same sparsity and values as P1 Poisson, so
its a 1000x1000 quadrilateral grid. This was just to time insertion.

  Matt

But this is a different problem. Since you know the sparsity pattern a
priori, you may be able to (i) not compute the sparsity pattern, (ii)


No, we only allocate correctly here.


Matt,

Is there much of a performance difference with MatSeqAIJSetPreallocationbetween setting the maximum number of non-zeroes per row (PetscInt nz),and setting the number of non-zeroes for each row (PetscInt nnz[]) whenthe number of non-zeroes per row doesn't differ greatly?


Garth

compute the entries more efficiently, (iii) not compute the
local-to-global mapping, and (iv) insert the entries more efficiently.


Insertion is the same and we compute the same mapping we always use.
I think you guys overcompute for the l2g.

  Matt

Our timings include all these steps + Python overhead. I'm going to
rewrite it in C++ so we can eliminate that source of uncertainty.

--
Anders

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFIhQQgTuwUCDsYZdERAnUzAJ93hfI/Psx6IccOdOr3GhbODAdFgACdFAj9
Mc0MiBbB+aiTEMXOajyrnog=
=oLL0
-----END PGP SIGNATURE-----

_______________________________________________
DOLFIN-dev mailing list
DOLFIN-dev@xxxxxxxxxx
http://www.fenics.org/mailman/listinfo/dolfin-dev

Follow ups

Re: Assembly benchmark
From: Matthew Knepley, 2008-07-22

References

Assembly benchmark
From: Anders Logg, 2008-07-21
Re: Assembly benchmark
From: Garth N. Wells, 2008-07-21
Re: Assembly benchmark
From: Anders Logg, 2008-07-21
Re: Assembly benchmark
From: Garth N. Wells, 2008-07-21
Re: Assembly benchmark
From: Matthew Knepley, 2008-07-21
Re: Assembly benchmark
From: Matthew Knepley, 2008-07-21
Re: Assembly benchmark
From: Anders Logg, 2008-07-21
Re: Assembly benchmark
From: Matthew Knepley, 2008-07-21
Re: Assembly benchmark
From: Anders Logg, 2008-07-21
Re: Assembly benchmark
From: Matthew Knepley, 2008-07-21