dolfin team mailing list archive

Thread
Date

Re: profiling an assembly

To: Matthew Knepley <knepley@xxxxxxxxx>
From: Anders Logg <logg@xxxxxxxxx>
Date: Sun, 18 May 2008 21:01:02 +0200
Cc: dolfin-dev@xxxxxxxxxx
Delivered-to: dolfin-dev@xxxxxxxxxx
In-reply-to: <a9f269830805181124q357913c6v751e417870a63e40@mail.gmail.com>
Mail-followup-to: Matthew Knepley <knepley@xxxxxxxxx>, jhoffman@xxxxxxxxxx, dolfin-dev@xxxxxxxxxx
User-agent: Mutt/1.5.17+20080114 (2008-01-14)

On Sun, May 18, 2008 at 01:24:19PM -0500, Matthew Knepley wrote:
> On Sat, May 17, 2008 at 9:40 AM, Johan Hoffman <jhoffman@xxxxxxxxxx> wrote:
> > If the petsc sparse matrix structure works as I expect; to insert/add an
> > element you go to the particular row and search all non-zero entries of
> > that row until you find your column. So the complexity would be = #dofs x
> > #row non-zero entries
> 
> Not exactly. We maintain sorted columns, so the expected time for all insertions
> is smaller. The numbers that Murtazo gets do not match our own, which suggests
> to me that the wrapper is taking non-trivial time for these small matrices.
> 
>    Matt

We might be doing something stupid in the wrapper (like giving PETSc
the wrong options), but I'm pretty sure the overhead the wrapper adds
when calling MatSetValues is small.

Here's the full code for PETScMatrix::add():

void PETScMatrix::add(const real* block,
                      uint m, const uint* rows,
                      uint n, const uint* cols)
{
  dolfin_assert(A);
  MatSetValues(A,
               static_cast<int>(m),
               reinterpret_cast<int*>(const_cast<uint*>(rows)),
               static_cast<int>(n),
               reinterpret_cast<int*>(const_cast<uint*>(cols)),
               block, ADD_VALUES);
}

This should be easy to check when profiling the code. Murtazo: What's
the difference between PETScMatrix::add and MatSetValues?

-- 
Anders


> > With a block matrix you can optimize this by just searching for the first
> > entry of the block.
> >
> > I am surprised that "add" should totally dominate the assembly algorithm.
> > Is this reasonable?
> >
> > Dag: the first assembly should typically involve an initialization (at
> > least for a non-linear problem), is this part of you test? If so, I think
> > it is strange that the difference between the first pass and the
> > reassemble do not differ (with a faster reassemble).
> >
> > For a PETSc matrix I do not recognize the std::lower_bound, is this ublas
> > specific?
> >
> > /Johan
> >
> >> It seems this thread got a bit derailed yesterday :(
> >>
> >> I've done some more careful profiling:
> >> *) Full assemble once
> >> *) Assemble matrix 30 times without reset
> >> in order to amortize the time for initialization.
> >>
> >> The call graph shows that std::lower_bound is called from add:
> >>       dolfin::GenericMatrix::add ->
> >>       dolfin::uBlasMatrix<>:add ->
> >>       std::lower_bound
> >>
> >> In this assembly add+children takes 89% of the time, and tabulate_tensor
> >> taking roughly 9%. Full (gprof) profile attached. Murtazo is probably
> >> right that the performance numbers are virtually the same with PETSc. I
> >> will hook it up and try (and let you know if this is not the case).
> >>
> >> I do appreciate the complexity of inserting elements into a sparse
> >> matrix, and I do _not_ claim to know better when it comes to the
> >> assembler architecture.
> >>
> >> Still, as I vary the size of the mesh I get this performance metric
> >> virtually constant:
> >> Assembled 7.3e+05 non-zero matrix elements per second (first pass)
> >> Assembled 1.4e+06 non-zero matrix elements per second (re-assemble).
> >>
> >> Is this a sensible metric? If so, is it well understood how the DOLFIN
> >> assembler performs? If not, I could put together a test-suite for a
> >> range of forms (2/3D, simple/mixed element, simple/complicated
> >> expressions in the form etc).
> >>
> >> To restate my question: How should I verify that the assembler is
> >> performing as expected here? Am I looking at some unexpected overhead in
> >> this assembly (we all know how hard this can be to spot with C++)?
> >>
> >> Thanks!
> >> /Dag
> >>
> >> Garth N. Wells wrote:
> >>>
> >>> Anders Logg wrote:
> >>>> On Fri, May 16, 2008 at 12:17:19AM +0200, Murtazo Nazarov wrote:
> >>>>>> Hello!
> >>>>>>
> >>>>>> I'm looking at a "suspiciously slow" assembly and would like to
> >>>>>> determine what is going on. In general, what should one expect the
> >>>>>> most
> >>>>>> time-consuming step to be?
> >>>>>>
> >>>>>> This is what my gprof looks like:
> >>>>>>
> >>>>>> Time:
> >>>>>> 61.97%  unsigned int const* std::lower_bound
> >>>>>> 25.84%  dolfin::uBlasMatrix<...>::add
> >>>>>> 8.27%
> >>>>>> UFC_NSEMomentum3DBilinearForm_cell_integral_0::tabulate_tensor
> >>>>>> 1.1%    dolfin::uBlasMatrix<...>::init
> >>>> Where is lower_bound used? From within uBlasMatrix::add or is it in
> >>>> building the sparsity pattern?
> >>>>
> >>>
> >>> I suspect that it's either in building the sparsity pattern or
> >>> initialising the uBLAS matrix. The matrix structure is initialised by
> >>> running across rows and inserting a zero. uBLAS doesn't provide a
> >>> mechanism for initialising the underlying data structures directly for a
> >>>   sparse matrix.
> >>>
> >>> Dag: could you you run the same test using PETSc as the backend?
> >>>
> >>>>> I got these numbers also. I understand that it is very painful in
> >>>>> large
> >>>>> computations.
> >>>>>
> >>>>> I see what is a problem with adding into the stiffness matrix A.
> >>>>> Searching
> >>>>> the position of the element which needs to be added takes very long
> >>>>> time,
> >>>>> especially if you are solving big problems with thousands unknowns and
> >>>>> repeating the assembling a lot of times!
> >>>> If you know a good way to avoid inserting entries into a sparse matrix
> >>>> during assembly, please tell me... :-)
> >>>>
> >>>> If the assembly is costly, you might want to try assembling the action
> >>>> of it instead and send that to a Krylov solver. Inserting into a
> >>>> vector is much easier than into a sparse matrix.
> >>>>
> >>>>> One way could be finding the global indices of the matrix A once, and
> >>>>> use
> >>>>> it in the assembly process. By this way we avoid of searching the
> >>>>> element
> >>>>> position and it makes the process significantly fast. But, there is a
> >>>>> problem: somehow I cannot get access to the global index of cell in
> >>>>> the A
> >>>>> and change it instead of using MatSetValues (in PETSc).
> >>>> I don't understand what you suggest here. We do precompute the
> >>>> sparsity pattern of the matrix and use that to preallocate, but I
> >>>> don't know of any other way to insert entries than MatSetValues.
> >>>>
> >>>
> >>> I doubt insertion is the real problem, especially as Dag noted that
> >>> subsequent assembly operations take only half the time since the matrix
> >>> is already initialised.
> >>>
> >>> PETSc (and no doubt Trilinos) do offer some assembly possibilities that
> >>> we haven't yet exploited because they require a reorganisation of the
> >>> dof map.
> >>>
> >>> Garth
> >>>
> >>>>> I am pretty sure that we may speed up the A.set() and A.get()
> >>>>> processes as
> >>>>> well by the above method.
> >>>> Please explain.
> >>>>
> >>>>> I am not sure how the dofmap to get rows and cols indices of the cells
> >>>>> is
> >>>>> implemented. We could avoid repeating this operation as well.
> >>>> This is already implemented (but maybe not used). DofMap handles this.
> >>>> It wraps the generated ufc::dof_map code and may pretabulate (and
> >>>> possibly reorder) the dofs.
> >>>>
> >>>>> We did some comparison with another free fem toolbox, FemLego, the
> >>>>> assembly process in Dolfin is 3 times slower than FemLego in 2D. I
> >>>>> believe
> >>>>> this number will increase in 3D. FemLego uses quadrature rule for
> >>>>> computing integrals.
> >>>> Can you benchmark the various parts of the assembly to see what causes
> >>>> the slowdown:
> >>>>
> >>>>   1. Is it tabulate_tensor?
> >>>>   2. Is it tabulate_dofs?
> >>>>   3. Is it A.add()?
> >>>>   4. Something else?
> >>>>
> >>>>> I hope some PETSc guys will help us to do this improvements. Any other
> >>>>> ideas are welcome!
> >>>> We are currently experimenting with collecting and preprocessing
> >>>> batches of entries before inserting into the global sparse matrix in
> >>>> hope of speeding up the assembly but we don't have any results yet.
> >>>>
> >>> _______________________________________________
> >>> DOLFIN-dev mailing list
> >>> DOLFIN-dev@xxxxxxxxxx
> >>> http://www.fenics.org/mailman/listinfo/dolfin-dev
> >> _______________________________________________
> >> DOLFIN-dev mailing list
> >> DOLFIN-dev@xxxxxxxxxx
> >> http://www.fenics.org/mailman/listinfo/dolfin-dev
> >>
> >
> >
> > _______________________________________________
> > DOLFIN-dev mailing list
> > DOLFIN-dev@xxxxxxxxxx
> > http://www.fenics.org/mailman/listinfo/dolfin-dev
> >
> 
> 
>

Follow ups

Re: profiling an assembly
From: Murtazo Nazarov, 2008-05-18

References

profiling an assembly
From: Dag Lindbo, 2008-05-15
Re: profiling an assembly
From: Murtazo Nazarov, 2008-05-15
Re: profiling an assembly
From: Anders Logg, 2008-05-16
Re: profiling an assembly
From: Garth N. Wells, 2008-05-16
Re: profiling an assembly
From: Dag Lindbo, 2008-05-17
Re: profiling an assembly
From: Johan Hoffman, 2008-05-17
Re: profiling an assembly
From: Matthew Knepley, 2008-05-18