dolfin team mailing list archive

Thread
Date

Re: profiling an assembly

To: "Dag Lindbo" <dag@xxxxxxxxxx>
From: "Johan Hoffman" <jhoffman@xxxxxxxxxx>
Date: Sat, 17 May 2008 16:40:48 +0200 (MEST)
Cc: dolfin-dev@xxxxxxxxxx
Delivered-to: dolfin-dev@xxxxxxxxxx
Importance: Normal
In-reply-to: <482EBF31.5080005@csc.kth.se>
Reply-to: jhoffman@xxxxxxxxxx
User-agent: SquirrelMail/1.4.9apa

If the petsc sparse matrix structure works as I expect; to insert/add an
element you go to the particular row and search all non-zero entries of
that row until you find your column. So the complexity would be = #dofs x
#row non-zero entries

With a block matrix you can optimize this by just searching for the first
entry of the block.

I am surprised that "add" should totally dominate the assembly algorithm.
Is this reasonable?

Dag: the first assembly should typically involve an initialization (at
least for a non-linear problem), is this part of you test? If so, I think
it is strange that the difference between the first pass and the
reassemble do not differ (with a faster reassemble).

For a PETSc matrix I do not recognize the std::lower_bound, is this ublas
specific?

/Johan

> It seems this thread got a bit derailed yesterday :(
>
> I've done some more careful profiling:
> *) Full assemble once
> *) Assemble matrix 30 times without reset
> in order to amortize the time for initialization.
>
> The call graph shows that std::lower_bound is called from add:
> 	dolfin::GenericMatrix::add ->
> 	dolfin::uBlasMatrix<>:add ->
> 	std::lower_bound
>
> In this assembly add+children takes 89% of the time, and tabulate_tensor
> taking roughly 9%. Full (gprof) profile attached. Murtazo is probably
> right that the performance numbers are virtually the same with PETSc. I
> will hook it up and try (and let you know if this is not the case).
>
> I do appreciate the complexity of inserting elements into a sparse
> matrix, and I do _not_ claim to know better when it comes to the
> assembler architecture.
>
> Still, as I vary the size of the mesh I get this performance metric
> virtually constant:
> Assembled 7.3e+05 non-zero matrix elements per second (first pass)
> Assembled 1.4e+06 non-zero matrix elements per second (re-assemble).
>
> Is this a sensible metric? If so, is it well understood how the DOLFIN
> assembler performs? If not, I could put together a test-suite for a
> range of forms (2/3D, simple/mixed element, simple/complicated
> expressions in the form etc).
>
> To restate my question: How should I verify that the assembler is
> performing as expected here? Am I looking at some unexpected overhead in
> this assembly (we all know how hard this can be to spot with C++)?
>
> Thanks!
> /Dag
>
> Garth N. Wells wrote:
>>
>> Anders Logg wrote:
>>> On Fri, May 16, 2008 at 12:17:19AM +0200, Murtazo Nazarov wrote:
>>>>> Hello!
>>>>>
>>>>> I'm looking at a "suspiciously slow" assembly and would like to
>>>>> determine what is going on. In general, what should one expect the
>>>>> most
>>>>> time-consuming step to be?
>>>>>
>>>>> This is what my gprof looks like:
>>>>>
>>>>> Time:
>>>>> 61.97%  unsigned int const* std::lower_bound
>>>>> 25.84%  dolfin::uBlasMatrix<...>::add
>>>>> 8.27%
>>>>> UFC_NSEMomentum3DBilinearForm_cell_integral_0::tabulate_tensor
>>>>> 1.1%    dolfin::uBlasMatrix<...>::init
>>> Where is lower_bound used? From within uBlasMatrix::add or is it in
>>> building the sparsity pattern?
>>>
>>
>> I suspect that it's either in building the sparsity pattern or
>> initialising the uBLAS matrix. The matrix structure is initialised by
>> running across rows and inserting a zero. uBLAS doesn't provide a
>> mechanism for initialising the underlying data structures directly for a
>>   sparse matrix.
>>
>> Dag: could you you run the same test using PETSc as the backend?
>>
>>>> I got these numbers also. I understand that it is very painful in
>>>> large
>>>> computations.
>>>>
>>>> I see what is a problem with adding into the stiffness matrix A.
>>>> Searching
>>>> the position of the element which needs to be added takes very long
>>>> time,
>>>> especially if you are solving big problems with thousands unknowns and
>>>> repeating the assembling a lot of times!
>>> If you know a good way to avoid inserting entries into a sparse matrix
>>> during assembly, please tell me... :-)
>>>
>>> If the assembly is costly, you might want to try assembling the action
>>> of it instead and send that to a Krylov solver. Inserting into a
>>> vector is much easier than into a sparse matrix.
>>>
>>>> One way could be finding the global indices of the matrix A once, and
>>>> use
>>>> it in the assembly process. By this way we avoid of searching the
>>>> element
>>>> position and it makes the process significantly fast. But, there is a
>>>> problem: somehow I cannot get access to the global index of cell in
>>>> the A
>>>> and change it instead of using MatSetValues (in PETSc).
>>> I don't understand what you suggest here. We do precompute the
>>> sparsity pattern of the matrix and use that to preallocate, but I
>>> don't know of any other way to insert entries than MatSetValues.
>>>
>>
>> I doubt insertion is the real problem, especially as Dag noted that
>> subsequent assembly operations take only half the time since the matrix
>> is already initialised.
>>
>> PETSc (and no doubt Trilinos) do offer some assembly possibilities that
>> we haven't yet exploited because they require a reorganisation of the
>> dof map.
>>
>> Garth
>>
>>>> I am pretty sure that we may speed up the A.set() and A.get()
>>>> processes as
>>>> well by the above method.
>>> Please explain.
>>>
>>>> I am not sure how the dofmap to get rows and cols indices of the cells
>>>> is
>>>> implemented. We could avoid repeating this operation as well.
>>> This is already implemented (but maybe not used). DofMap handles this.
>>> It wraps the generated ufc::dof_map code and may pretabulate (and
>>> possibly reorder) the dofs.
>>>
>>>> We did some comparison with another free fem toolbox, FemLego, the
>>>> assembly process in Dolfin is 3 times slower than FemLego in 2D. I
>>>> believe
>>>> this number will increase in 3D. FemLego uses quadrature rule for
>>>> computing integrals.
>>> Can you benchmark the various parts of the assembly to see what causes
>>> the slowdown:
>>>
>>>   1. Is it tabulate_tensor?
>>>   2. Is it tabulate_dofs?
>>>   3. Is it A.add()?
>>>   4. Something else?
>>>
>>>> I hope some PETSc guys will help us to do this improvements. Any other
>>>> ideas are welcome!
>>> We are currently experimenting with collecting and preprocessing
>>> batches of entries before inserting into the global sparse matrix in
>>> hope of speeding up the assembly but we don't have any results yet.
>>>
>> _______________________________________________
>> DOLFIN-dev mailing list
>> DOLFIN-dev@xxxxxxxxxx
>> http://www.fenics.org/mailman/listinfo/dolfin-dev
> _______________________________________________
> DOLFIN-dev mailing list
> DOLFIN-dev@xxxxxxxxxx
> http://www.fenics.org/mailman/listinfo/dolfin-dev
>

Follow ups

Re: profiling an assembly
From: Matthew Knepley, 2008-05-18
Re: profiling an assembly
From: Anders Logg, 2008-05-18

References

profiling an assembly
From: Dag Lindbo, 2008-05-15
Re: profiling an assembly
From: Murtazo Nazarov, 2008-05-15
Re: profiling an assembly
From: Anders Logg, 2008-05-16
Re: profiling an assembly
From: Garth N. Wells, 2008-05-16
Re: profiling an assembly
From: Dag Lindbo, 2008-05-17