dolfin team mailing list archive

Thread
Date

Re: profiling an assembly

To: dolfin-dev@xxxxxxxxxx
From: Jed Brown <jed@xxxxxxxx>
Date: Sat, 17 May 2008 18:29:19 +0200
Delivered-to: dolfin-dev@xxxxxxxxxx
In-reply-to: <64525.83.250.57.183.1211035674.squirrel@webmail.csc.kth.se>
Mail-followup-to: dolfin-dev@xxxxxxxxxx
Sender: Jed Brown <five9a2@xxxxxxxxx>
User-agent: Mutt/1.5.17 (2007-11-01)

On Sat 2008-05-17 16:47, Johan Hoffman wrote:
> > Yes, I got the same numbers with PETSc. I checked and it it the same
> > problem with uBlas, I am pretty sure that searching the elements in the
> > assembly takes a very long time. Is it possible to change the element
> > matrix A(indx) directly in uBlas? If it is possible we may get that the
> > speedup would be much.

I really doubt this.  Indexing the matrix this way is slow.  It would be similar
to calling MatInsertValue (no -s).  Perhaps the trouble is that you set the same
value many times.  Is matrix assembly really significant compared to solving?

> > There is a MatSetOption in PETSc, MAT_USE_HASH_TABLE, to take care what
> > exactly I would like to have. But, that option does not work with AIJ
> > format we are using in dolfin.
> 
> Ok. Good. Why does this not work? With what matrix formats does it work?

According to the source, only the BAIJ formats.  Assuming you never eliminate
parts of a vector when enforcing boundary conditions (though you usually
should), BAIJ would probably be better.  However, when I asked about this
recently, Barry said it is unlikely to make a big difference assuming inodes are
being used in the AIJ matrix.

> >> Still, as I vary the size of the mesh I get this performance metric
> >> virtually constant:
> >> Assembled 7.3e+05 non-zero matrix elements per second (first pass)
> >> Assembled 1.4e+06 non-zero matrix elements per second (re-assemble).

This number seems a little low considering that in pure PETSc codes built
without debugging, I can see 10^7/sec.  That is, a particular FD matrix with 2e7
nonzeros takes 2 seconds to assemble.  Of course, in this setting, we set all
the elements in a row at once and never come back to it.  In the finite element
setting, assembly will always take longer, but it should be much less than
solving for bigger systems.  In FEM, you are updating some of the same values
multiple times.  Corner nodes with triangular elements show up in about 6
elements and with tetrahedral elements, in around 20.  If you have a low-order
discretization, most or all of your degrees of freedom are shared by multiple
elements, so you add contributions several times per row.  There is no way to
`fix' this in the finite element framework.  How long does it take to solve the
system?  Have you compared with a PETSc built without debugging (can make a big
difference!)?  What does running with -log_summary give?  Is the preallocation
correct on the first pass?  (Run with -info and look for the number of times
malloc was called in the first assembly pass.)

Also note Andy Terrel's post regarding preallocation.  This is the same bug I
mentioned in this thread.  Moving MatSetFromOptions() up fixes it.  The trouble
is that MatSetUp() (MatSetFromOptions() calls it) must be called before
preallocation so the information doesn't get discarded.  When you call
MatSeqAIJSetPreallocation(), it will preallocate space for any matrix type which
inherits from SeqAIJ.  You can also call MatMPIAIJSetPreallocation() at the same
point and it will also work for all matrix types which inherit from MPIAIJ.
This would cover all the direct solvers people are likely to use (Mumps,
Umfpack, Spooles, SuperLU).  If block matrices are used, you can call the BAIJ
versions here as well.

Before the preallocation fix:

[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 25140 X 25140; storage space: 261180 unneeded,730140 used
[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 57708
[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 59

After the fix:

[0] MatAssemblyEnd_SeqAIJ(): Matrix size: 25140 X 25140; storage space: 0 unneeded,730140 used
[0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0
[0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 59

The time difference is around 3 orders of magnitude.

Jed

Attachment: pgpBovMekp9Ch.pgp
Description: PGP signature

Follow ups

MatSetValues profiling
From: Murtazo Nazarov, 2008-05-17

References

profiling an assembly
From: Dag Lindbo, 2008-05-15
Re: profiling an assembly
From: Murtazo Nazarov, 2008-05-15
Re: profiling an assembly
From: Anders Logg, 2008-05-16
Re: profiling an assembly
From: Garth N. Wells, 2008-05-16
Re: profiling an assembly
From: Dag Lindbo, 2008-05-17
Re: profiling an assembly
From: Murtazo Nazarov, 2008-05-17
Re: profiling an assembly
From: Johan Hoffman, 2008-05-17