dolfin team mailing list archive

Thread
Date

Re: profiling an assembly

To: murtazo@xxxxxxxxxxx
From: "Garth N. Wells" <gnw20@xxxxxxxxx>
Date: Sun, 18 May 2008 20:04:58 +0100
Cc: dolfin-dev@xxxxxxxxxx
Delivered-to: dolfin-dev@xxxxxxxxxx
In-reply-to: <54622.85.229.23.207.1211136371.squirrel@webmail.sys.kth.se>
User-agent: Thunderbird 2.0.0.14 (X11/20080505)



Murtazo Nazarov wrote:

On Sat, May 17, 2008 at 04:40:48PM +0200, Johan Hoffman wrote:

If the petsc sparse matrix structure works as I expect; to insert/add an
element you go to the particular row and search all non-zero entries of
that row until you find your column. So the complexity would be = #dofs
x
#row non-zero entries

With a block matrix you can optimize this by just searching for the
first
entry of the block.

I am surprised that "add" should totally dominate the assembly
algorithm.
Is this reasonable?

It's remarkable but it certainly isn't any surpise, see here for example:

  http://simula.no/research/scientific/publications/Simula.SC.46

This wasn't always the case. We've worked quite hard over the last
years to speed up the computation of the element matrix
(tabulate_tensor), to speed up the computation of the dof map
(tabulate_dofs) and to speed up the time it takes to access geometry
(the new Mesh classes). So if the one remaining step (A.add()) now
dominates it means we have been successful. It also means that to gain
more, we need to improve on A.add().


Good! It seems FFC does pretty good job, since only ~15% of the assembly
time is spent to tabulate_tensor, and tabulate_dofs. The most time ~70%
takes A.add().

But we also need to remember that

1. Solve time may dominate assemble anyway so that's where we should
optimize.

2. Assembling the action instead of the operator removes the A.add()
bottleneck.

As mentioned before, we are experimenting with iterating locally over
cells sharing common dofs and combining batches of element tensors
before inserting into the global sparse matrix row by row. Let's see
how it goes.

Some other comments:

Murtazo: It's interesting that femLego is that much faster. It would
be interesting to see exactly why it is faster. Can you take a simple
example (maybe even Poisson) and profile both femLego and DOLFIN on
the same mesh and get detailed results for tabulate_tensor,
tabulate_dofs, A.add(). If femLego is faster on A.add(), then what
linear algebra backend is it using?


Yes, the test we did it is simple 2D Poisson with unite square mesh and an
assembly in FemLego is 3 time faster, because A.add() is done in a way I
wrote in previous mails. The linear algebra package is AZTEC. Perhaps,
dolfin should be much faster than femLego if A.add() is the same, since
FFC is very fast than quadrature rule.

It would be useful if you could provide details of the test. Does ittest assembly alone? Is the mesh initialised before the timing? Whatmatrix format does FemLego use? Fast to assemble data structures areoften slow for matrix-vector multiplications.

I noticed from a previous post that you're using an old version ofDOLFIN. For benchmarking, it would be much more useful to use thedevelopment version. There have been various improvements, for exampleto the computation of the sparsity pattern.

By the way, Kristian has shown that the speed of FFC is due largely tothe various a priori optimisations. FFC can now produce tensorcontraction code and quadrature code which is similar in performance.


Garth

Murtazo: It seems you suggest we should basically store some kind of
index/pointer for where to write directly to memory when adding the
entries. This has two problems: (i) we need to store quite a few of
these indices (n^2*num_cells where n is the local space dimension),
and (ii) it requires cooperation from the linear algebra backend.
We would need to ask PETSc to tell us where in its private data
structures it inserted the entries.


Yes, maybe there is a better way to do it. If we store the global indices
of the A it will be totally A.nz()*numVertices*num_components, but still
we will have a speedup which is more important in some case.

/murtazo

Dag: A test suite/benchmark would be very good to have. We need to
have something in bench/fem/assembly that we can trust and run
regularly to make sure we don't have regressions.

Jed: You seem to know your way around PETSc, so any specific
recommendations regarding the initialization of matrices in
PETScMatrix.cpp are appreciated.

--
Anders

Dag: the first assembly should typically involve an initialization
(at least for a non-linear problem), is this part of you test? If
so, I think it is strange that the difference between the first pass
and the reassemble do not differ (with a faster reassemble).

For a PETSc matrix I do not recognize the std::lower_bound, is this
ublas
specific?

/Johan

It seems this thread got a bit derailed yesterday :(

I've done some more careful profiling:
*) Full assemble once
*) Assemble matrix 30 times without reset
in order to amortize the time for initialization.

The call graph shows that std::lower_bound is called from add:
	dolfin::GenericMatrix::add ->
	dolfin::uBlasMatrix<>:add ->
	std::lower_bound

In this assembly add+children takes 89% of the time, and

tabulate_tensor

taking roughly 9%. Full (gprof) profile attached. Murtazo is probably
right that the performance numbers are virtually the same with PETSc.

will hook it up and try (and let you know if this is not the case).

I do appreciate the complexity of inserting elements into a sparse
matrix, and I do _not_ claim to know better when it comes to the
assembler architecture.

Still, as I vary the size of the mesh I get this performance metric
virtually constant:
Assembled 7.3e+05 non-zero matrix elements per second (first pass)
Assembled 1.4e+06 non-zero matrix elements per second (re-assemble).

Is this a sensible metric? If so, is it well understood how the DOLFIN
assembler performs? If not, I could put together a test-suite for a
range of forms (2/3D, simple/mixed element, simple/complicated
expressions in the form etc).

To restate my question: How should I verify that the assembler is
performing as expected here? Am I looking at some unexpected overhead

in

this assembly (we all know how hard this can be to spot with C++)?

Thanks!
/Dag

Garth N. Wells wrote:

Anders Logg wrote:

On Fri, May 16, 2008 at 12:17:19AM +0200, Murtazo Nazarov wrote:

Hello!

I'm looking at a "suspiciously slow" assembly and would like to
determine what is going on. In general, what should one expect the
most
time-consuming step to be?

This is what my gprof looks like:

Time:
61.97%  unsigned int const* std::lower_bound
25.84%  dolfin::uBlasMatrix<...>::add
8.27%
UFC_NSEMomentum3DBilinearForm_cell_integral_0::tabulate_tensor
1.1%    dolfin::uBlasMatrix<...>::init

Where is lower_bound used? From within uBlasMatrix::add or is it in
building the sparsity pattern?

I suspect that it's either in building the sparsity pattern or
initialising the uBLAS matrix. The matrix structure is initialised by
running across rows and inserting a zero. uBLAS doesn't provide a
mechanism for initialising the underlying data structures directly

for a

  sparse matrix.

Dag: could you you run the same test using PETSc as the backend?

I got these numbers also. I understand that it is very painful in
large
computations.

I see what is a problem with adding into the stiffness matrix A.
Searching
the position of the element which needs to be added takes very long
time,
especially if you are solving big problems with thousands unknowns

and

repeating the assembling a lot of times!

If you know a good way to avoid inserting entries into a sparse

matrix

during assembly, please tell me... :-)

If the assembly is costly, you might want to try assembling the

action

of it instead and send that to a Krylov solver. Inserting into a
vector is much easier than into a sparse matrix.

One way could be finding the global indices of the matrix A once,

and

use
it in the assembly process. By this way we avoid of searching the
element
position and it makes the process significantly fast. But, there is

problem: somehow I cannot get access to the global index of cell in
the A
and change it instead of using MatSetValues (in PETSc).

I don't understand what you suggest here. We do precompute the
sparsity pattern of the matrix and use that to preallocate, but I
don't know of any other way to insert entries than MatSetValues.

I doubt insertion is the real problem, especially as Dag noted that
subsequent assembly operations take only half the time since the

matrix

is already initialised.

PETSc (and no doubt Trilinos) do offer some assembly possibilities

that

we haven't yet exploited because they require a reorganisation of the
dof map.

Garth

I am pretty sure that we may speed up the A.set() and A.get()
processes as
well by the above method.

Please explain.

I am not sure how the dofmap to get rows and cols indices of the

cells

is
implemented. We could avoid repeating this operation as well.

This is already implemented (but maybe not used). DofMap handles

this.

It wraps the generated ufc::dof_map code and may pretabulate (and
possibly reorder) the dofs.

We did some comparison with another free fem toolbox, FemLego, the
assembly process in Dolfin is 3 times slower than FemLego in 2D. I
believe
this number will increase in 3D. FemLego uses quadrature rule for
computing integrals.

Can you benchmark the various parts of the assembly to see what

causes

the slowdown:

  1. Is it tabulate_tensor?
  2. Is it tabulate_dofs?
  3. Is it A.add()?
  4. Something else?

I hope some PETSc guys will help us to do this improvements. Any

other

ideas are welcome!

We are currently experimenting with collecting and preprocessing
batches of entries before inserting into the global sparse matrix in
hope of speeding up the assembly but we don't have any results yet.

_______________________________________________
DOLFIN-dev mailing list
DOLFIN-dev@xxxxxxxxxx
http://www.fenics.org/mailman/listinfo/dolfin-dev

_______________________________________________
DOLFIN-dev mailing list
DOLFIN-dev@xxxxxxxxxx
http://www.fenics.org/mailman/listinfo/dolfin-dev


_______________________________________________
DOLFIN-dev mailing list
DOLFIN-dev@xxxxxxxxxx
http://www.fenics.org/mailman/listinfo/dolfin-dev

_______________________________________________
DOLFIN-dev mailing list
DOLFIN-dev@xxxxxxxxxx
http://www.fenics.org/mailman/listinfo/dolfin-dev



_______________________________________________
DOLFIN-dev mailing list
DOLFIN-dev@xxxxxxxxxx
http://www.fenics.org/mailman/listinfo/dolfin-dev


--
Dr Garth N Wells
Department of Engineering
University of Cambridge
Trumpington Street
Cambridge CB2 1PZ
United Kingdom

tel.   +44 1223 3 32743
fax.   +44 1223 3 32662
e-mail gnw20@xxxxxxxxx

References

profiling an assembly
From: Dag Lindbo, 2008-05-15
Re: profiling an assembly
From: Murtazo Nazarov, 2008-05-15
Re: profiling an assembly
From: Anders Logg, 2008-05-16
Re: profiling an assembly
From: Garth N. Wells, 2008-05-16
Re: profiling an assembly
From: Dag Lindbo, 2008-05-17
Re: profiling an assembly
From: Johan Hoffman, 2008-05-17
Re: profiling an assembly
From: Anders Logg, 2008-05-18
Re: profiling an assembly
From: Murtazo Nazarov, 2008-05-18