← Back to team overview

dolfin team mailing list archive

Re: profiling an assembly

 

It seems this thread got a bit derailed yesterday :(

I've done some more careful profiling:
*) Full assemble once
*) Assemble matrix 30 times without reset
in order to amortize the time for initialization.

The call graph shows that std::lower_bound is called from add:
	dolfin::GenericMatrix::add ->
	dolfin::uBlasMatrix<>:add ->
	std::lower_bound

In this assembly add+children takes 89% of the time, and tabulate_tensor
taking roughly 9%. Full (gprof) profile attached. Murtazo is probably
right that the performance numbers are virtually the same with PETSc. I
will hook it up and try (and let you know if this is not the case).

I do appreciate the complexity of inserting elements into a sparse
matrix, and I do _not_ claim to know better when it comes to the
assembler architecture.

Still, as I vary the size of the mesh I get this performance metric
virtually constant:
Assembled 7.3e+05 non-zero matrix elements per second (first pass)
Assembled 1.4e+06 non-zero matrix elements per second (re-assemble).

Is this a sensible metric? If so, is it well understood how the DOLFIN
assembler performs? If not, I could put together a test-suite for a
range of forms (2/3D, simple/mixed element, simple/complicated
expressions in the form etc).

To restate my question: How should I verify that the assembler is
performing as expected here? Am I looking at some unexpected overhead in
this assembly (we all know how hard this can be to spot with C++)?

Thanks!
/Dag

Garth N. Wells wrote:
> 
> Anders Logg wrote:
>> On Fri, May 16, 2008 at 12:17:19AM +0200, Murtazo Nazarov wrote:
>>>> Hello!
>>>>
>>>> I'm looking at a "suspiciously slow" assembly and would like to
>>>> determine what is going on. In general, what should one expect the most
>>>> time-consuming step to be?
>>>>
>>>> This is what my gprof looks like:
>>>>
>>>> Time:
>>>> 61.97%  unsigned int const* std::lower_bound
>>>> 25.84%  dolfin::uBlasMatrix<...>::add
>>>> 8.27%   UFC_NSEMomentum3DBilinearForm_cell_integral_0::tabulate_tensor
>>>> 1.1%    dolfin::uBlasMatrix<...>::init
>> Where is lower_bound used? From within uBlasMatrix::add or is it in
>> building the sparsity pattern?
>>
> 
> I suspect that it's either in building the sparsity pattern or 
> initialising the uBLAS matrix. The matrix structure is initialised by 
> running across rows and inserting a zero. uBLAS doesn't provide a 
> mechanism for initialising the underlying data structures directly for a 
>   sparse matrix.
> 
> Dag: could you you run the same test using PETSc as the backend?
> 
>>> I got these numbers also. I understand that it is very painful in large
>>> computations.
>>>
>>> I see what is a problem with adding into the stiffness matrix A. Searching
>>> the position of the element which needs to be added takes very long time,
>>> especially if you are solving big problems with thousands unknowns and
>>> repeating the assembling a lot of times!
>> If you know a good way to avoid inserting entries into a sparse matrix
>> during assembly, please tell me... :-)
>>
>> If the assembly is costly, you might want to try assembling the action
>> of it instead and send that to a Krylov solver. Inserting into a
>> vector is much easier than into a sparse matrix.
>>
>>> One way could be finding the global indices of the matrix A once, and use
>>> it in the assembly process. By this way we avoid of searching the element
>>> position and it makes the process significantly fast. But, there is a
>>> problem: somehow I cannot get access to the global index of cell in the A
>>> and change it instead of using MatSetValues (in PETSc).
>> I don't understand what you suggest here. We do precompute the
>> sparsity pattern of the matrix and use that to preallocate, but I
>> don't know of any other way to insert entries than MatSetValues.
>>
> 
> I doubt insertion is the real problem, especially as Dag noted that 
> subsequent assembly operations take only half the time since the matrix 
> is already initialised.
> 
> PETSc (and no doubt Trilinos) do offer some assembly possibilities that 
> we haven't yet exploited because they require a reorganisation of the 
> dof map.
> 
> Garth
> 
>>> I am pretty sure that we may speed up the A.set() and A.get() processes as
>>> well by the above method.
>> Please explain.
>>
>>> I am not sure how the dofmap to get rows and cols indices of the cells is
>>> implemented. We could avoid repeating this operation as well.
>> This is already implemented (but maybe not used). DofMap handles this.
>> It wraps the generated ufc::dof_map code and may pretabulate (and
>> possibly reorder) the dofs.
>>
>>> We did some comparison with another free fem toolbox, FemLego, the
>>> assembly process in Dolfin is 3 times slower than FemLego in 2D. I believe
>>> this number will increase in 3D. FemLego uses quadrature rule for
>>> computing integrals.
>> Can you benchmark the various parts of the assembly to see what causes
>> the slowdown:
>>
>>   1. Is it tabulate_tensor?
>>   2. Is it tabulate_dofs?
>>   3. Is it A.add()?
>>   4. Something else?
>>
>>> I hope some PETSc guys will help us to do this improvements. Any other
>>> ideas are welcome!
>> We are currently experimenting with collecting and preprocessing
>> batches of entries before inserting into the global sparse matrix in
>> hope of speeding up the assembly but we don't have any results yet.
>>
> _______________________________________________
> DOLFIN-dev mailing list
> DOLFIN-dev@xxxxxxxxxx
> http://www.fenics.org/mailman/listinfo/dolfin-dev

Attachment: assembly_profile_2.gz
Description: GNU Zip compressed data


Follow ups

References