dolfin team mailing list archive

Thread
Date

Re: profiling an assembly

To: murtazo@xxxxxxxxxxx
From: "Johan Hoffman" <jhoffman@xxxxxxxxxx>
Date: Sat, 17 May 2008 16:47:54 +0200 (MEST)
Cc: dolfin-dev@xxxxxxxxxx
Delivered-to: dolfin-dev@xxxxxxxxxx
Importance: Normal
In-reply-to: <39093.85.229.23.207.1211025044.squirrel@webmail.sys.kth.se>
Reply-to: jhoffman@xxxxxxxxxx
User-agent: SquirrelMail/1.4.9apa

>> It seems this thread got a bit derailed yesterday :(
>>
>> I've done some more careful profiling:
>> *) Full assemble once
>> *) Assemble matrix 30 times without reset
>> in order to amortize the time for initialization.
>>
>> The call graph shows that std::lower_bound is called from add:
>> 	dolfin::GenericMatrix::add ->
>> 	dolfin::uBlasMatrix<>:add ->
>> 	std::lower_bound
>>
>> In this assembly add+children takes 89% of the time, and tabulate_tensor
>> taking roughly 9%. Full (gprof) profile attached. Murtazo is probably
>> right that the performance numbers are virtually the same with PETSc. I
>> will hook it up and try (and let you know if this is not the case).
>>
>
> Yes, I got the same numbers with PETSc. I checked and it it the same
> problem with uBlas, I am pretty sure that searching the elements in the
> assembly takes a very long time. Is it possible to change the element
> matrix A(indx) directly in uBlas? If it is possible we may get that the
> speedup would be much.
>
> There is a MatSetOption in PETSc, MAT_USE_HASH_TABLE, to take care what
> exactly I would like to have. But, that option does not work with AIJ
> format we are using in dolfin.

Ok. Good. Why does this not work? With what matrix formats does it work?

/Johan

> murtazo
>
>> I do appreciate the complexity of inserting elements into a sparse
>> matrix, and I do _not_ claim to know better when it comes to the
>> assembler architecture.
>>
>> Still, as I vary the size of the mesh I get this performance metric
>> virtually constant:
>> Assembled 7.3e+05 non-zero matrix elements per second (first pass)
>> Assembled 1.4e+06 non-zero matrix elements per second (re-assemble).
>>
>> Is this a sensible metric? If so, is it well understood how the DOLFIN
>> assembler performs? If not, I could put together a test-suite for a
>> range of forms (2/3D, simple/mixed element, simple/complicated
>> expressions in the form etc).
>>
>> To restate my question: How should I verify that the assembler is
>> performing as expected here? Am I looking at some unexpected overhead in
>> this assembly (we all know how hard this can be to spot with C++)?
>>
>> Thanks!
>> /Dag
>>
>> Garth N. Wells wrote:
>>>
>>> Anders Logg wrote:
>>>> On Fri, May 16, 2008 at 12:17:19AM +0200, Murtazo Nazarov wrote:
>>>>>> Hello!
>>>>>>
>>>>>> I'm looking at a "suspiciously slow" assembly and would like to
>>>>>> determine what is going on. In general, what should one expect the
>>>>>> most
>>>>>> time-consuming step to be?
>>>>>>
>>>>>> This is what my gprof looks like:
>>>>>>
>>>>>> Time:
>>>>>> 61.97%  unsigned int const* std::lower_bound
>>>>>> 25.84%  dolfin::uBlasMatrix<...>::add
>>>>>> 8.27%
>>>>>> UFC_NSEMomentum3DBilinearForm_cell_integral_0::tabulate_tensor
>>>>>> 1.1%    dolfin::uBlasMatrix<...>::init
>>>> Where is lower_bound used? From within uBlasMatrix::add or is it in
>>>> building the sparsity pattern?
>>>>
>>>
>>> I suspect that it's either in building the sparsity pattern or
>>> initialising the uBLAS matrix. The matrix structure is initialised by
>>> running across rows and inserting a zero. uBLAS doesn't provide a
>>> mechanism for initialising the underlying data structures directly for
>>> a
>>>   sparse matrix.
>>>
>>> Dag: could you you run the same test using PETSc as the backend?
>>>
>>>>> I got these numbers also. I understand that it is very painful in
>>>>> large
>>>>> computations.
>>>>>
>>>>> I see what is a problem with adding into the stiffness matrix A.
>>>>> Searching
>>>>> the position of the element which needs to be added takes very long
>>>>> time,
>>>>> especially if you are solving big problems with thousands unknowns
>>>>> and
>>>>> repeating the assembling a lot of times!
>>>> If you know a good way to avoid inserting entries into a sparse matrix
>>>> during assembly, please tell me... :-)
>>>>
>>>> If the assembly is costly, you might want to try assembling the action
>>>> of it instead and send that to a Krylov solver. Inserting into a
>>>> vector is much easier than into a sparse matrix.
>>>>
>>>>> One way could be finding the global indices of the matrix A once, and
>>>>> use
>>>>> it in the assembly process. By this way we avoid of searching the
>>>>> element
>>>>> position and it makes the process significantly fast. But, there is a
>>>>> problem: somehow I cannot get access to the global index of cell in
>>>>> the A
>>>>> and change it instead of using MatSetValues (in PETSc).
>>>> I don't understand what you suggest here. We do precompute the
>>>> sparsity pattern of the matrix and use that to preallocate, but I
>>>> don't know of any other way to insert entries than MatSetValues.
>>>>
>>>
>>> I doubt insertion is the real problem, especially as Dag noted that
>>> subsequent assembly operations take only half the time since the matrix
>>> is already initialised.
>>>
>>> PETSc (and no doubt Trilinos) do offer some assembly possibilities that
>>> we haven't yet exploited because they require a reorganisation of the
>>> dof map.
>>>
>>> Garth
>>>
>>>>> I am pretty sure that we may speed up the A.set() and A.get()
>>>>> processes as
>>>>> well by the above method.
>>>> Please explain.
>>>>
>>>>> I am not sure how the dofmap to get rows and cols indices of the
>>>>> cells
>>>>> is
>>>>> implemented. We could avoid repeating this operation as well.
>>>> This is already implemented (but maybe not used). DofMap handles this.
>>>> It wraps the generated ufc::dof_map code and may pretabulate (and
>>>> possibly reorder) the dofs.
>>>>
>>>>> We did some comparison with another free fem toolbox, FemLego, the
>>>>> assembly process in Dolfin is 3 times slower than FemLego in 2D. I
>>>>> believe
>>>>> this number will increase in 3D. FemLego uses quadrature rule for
>>>>> computing integrals.
>>>> Can you benchmark the various parts of the assembly to see what causes
>>>> the slowdown:
>>>>
>>>>   1. Is it tabulate_tensor?
>>>>   2. Is it tabulate_dofs?
>>>>   3. Is it A.add()?
>>>>   4. Something else?
>>>>
>>>>> I hope some PETSc guys will help us to do this improvements. Any
>>>>> other
>>>>> ideas are welcome!
>>>> We are currently experimenting with collecting and preprocessing
>>>> batches of entries before inserting into the global sparse matrix in
>>>> hope of speeding up the assembly but we don't have any results yet.
>>>>
>>> _______________________________________________
>>> DOLFIN-dev mailing list
>>> DOLFIN-dev@xxxxxxxxxx
>>> http://www.fenics.org/mailman/listinfo/dolfin-dev
>> _______________________________________________
>> DOLFIN-dev mailing list
>> DOLFIN-dev@xxxxxxxxxx
>> http://www.fenics.org/mailman/listinfo/dolfin-dev
>>
>
>
> _______________________________________________
> DOLFIN-dev mailing list
> DOLFIN-dev@xxxxxxxxxx
> http://www.fenics.org/mailman/listinfo/dolfin-dev
>

Follow ups

Re: profiling an assembly
From: Jed Brown, 2008-05-17

References

profiling an assembly
From: Dag Lindbo, 2008-05-15
Re: profiling an assembly
From: Murtazo Nazarov, 2008-05-15
Re: profiling an assembly
From: Anders Logg, 2008-05-16
Re: profiling an assembly
From: Garth N. Wells, 2008-05-16
Re: profiling an assembly
From: Dag Lindbo, 2008-05-17
Re: profiling an assembly
From: Murtazo Nazarov, 2008-05-17