← Back to team overview

dolfin team mailing list archive

Re: profiling an assembly

 

Anders Logg wrote:
On Mon, May 19, 2008 at 03:44:16PM +0200, Murtazo Nazarov wrote:
Anders Logg wrote:
On Sun, May 18, 2008 at 10:55:10PM +0200, Johan Hoffman wrote:
On Sun 2008-05-18 21:54, Johan Hoffman wrote:
On Sat, May 17, 2008 at 04:40:48PM +0200, Johan Hoffman wrote:

1. Solve time may dominate assemble anyway so that's where we should
optimize.
Yes, there may be such cases, in particular for simple forms (Laplace
equation etc.). For more complex forms with more terms and coefficients,
assembly typically dominates, from what I have seen. This is the case
for
the flow problems of Murtazo for example.
This probably depends if you use are using a projection method.  If you
are
solving the saddle point problem, you can forget about assembly time.
Well, this is not what we see. I agree that this is what you would like,
but this is not the case now. That is why we are now focusing on the
assembly bottleneck.

But
optimizing the solve is all about constructing a good preconditioner.  If
the
operator is elliptic then AMG should work well and you don't have to
think, but
if it is indefinite all bets are off.  I think we can build saddle point
preconditioners now by writing some funny-looking mixed form files, but
that
could be made easier.
We use a splitting approach with GMRES for the momentum equation and AMG
for the continuity equations. This appears to work faitly well. As I said,
the assembly of the momentum equation is dominating.

2. Assembling the action instead of the operator removes the A.add()
bottleneck.
True. But it may be worthwhile to put some effort into optimizing also
the
matrix assembly.
In any case, you have to form something to precondition with.

As mentioned before, we are experimenting with iterating locally over
cells sharing common dofs and combining batches of element tensors
before inserting into the global sparse matrix row by row. Let's see
how it goes.
Yes, this is interesting. Would be very interesting to hear about the
progress.

It is also interesting to understand what would optimize the insertion
for
different linear algebra backends, in particular Jed seems to have a
good
knowledge on petsc. We could then build backend optimimization into the
local dof-orderings etc.
I just press M-. when I'm curious :-)

I can't imagine it pays to optimize for a particular backend (it's not
PETSc
anyway, rather whichever format is used by the preconditioner).  The CSR
data
structure is pretty common, but it will always be fastest to insert an
entire
row at once.  If using an intermediate hashed structure makes this
convenient,
then it would help.  The paper I posted assembles the entire matrix in
hashed
format and then converts it to CSR.  I'll guess that a hashed cache for
the
assembly (flushed every few MiB, for instance) would work at least as well
as
assembling the entire thing in hashed format.
Yes, it seems that some form of hashed structure is a good possibility to
optimize. What Murtazo is referring to would be similar to hash the whole
matrix as in the paper you posted,
The way I interpret it, they are very different. The hash would store
a mapping from (i, j) to values while Murtazo suggest storing a
mapping from (element, i, j) to values.

Sorry, if i, j is already in global, than (element, i,j) is equivalent to just (i,j), it means we can just do mapping of (i,j) to values. The reason I included the element is that, i and j a local before the global mapping.

How is that possible?

When we assemble, we need to know for each element where to insert
each of the local n^2 entries. Then we need a mapping from
(element, i, j) to a position in the global matrix.

True. Maybe I am confusing with the notations. Right, the cost will be n2*num_cells.

murtazo


References