dolfin team mailing list archive
-
dolfin team
-
Mailing list archive
-
Message #21800
Re: assemble of Matrix with Real spaces slow
On Friday March 4 2011 09:23:58 Garth N. Wells wrote:
> On 04/03/11 17:11, Johan Hake wrote:
> > On Friday March 4 2011 08:48:14 Garth N. Wells wrote:
> >> On 04/03/11 16:38, Johan Hake wrote:
> >>> On Friday March 4 2011 03:29:32 Garth N. Wells wrote:
> >>>> On 03/03/11 19:48, Johan Hake wrote:
> >>>>> On Thursday March 3 2011 11:20:03 Marie E. Rognes wrote:
> >>>>>> On 03/03/2011 08:03 PM, Johan Hake wrote:
> >>>>>>> Hello!
> >>>>>>>
> >>>>>>> I am using Mixed spaces with Reals quite alot. It turnes out that
> >>>>>>> assemble forms with functions from MixedFunctionSpaces containing
> >>>>>>> Real spaces are dead slow. The time spent also increase with the
> >>>>>>> number of included Real spaces, even if none of them are included
> >>>>>>> in form which is assembled.
> >>>>>>>
> >>>>>>> The attached test script illustrates this.
> >>>>>>
> >>>>>> By replacing "CG", 1 by "R", 0 or?
> >>>>>
> >>>>> OMG!! Yes, *flush*
> >>>>>
> >>>>> That explains the memory usage :P
> >>>>>
> >>>>>>> The test script also reviels that an unproportial time is spent in
> >>>>>>> FFC generating the code. This time also increase with the number of
> >>>>>>> Real spaces included. Turning of FErari helped a bit with this
> >>>>>>> point.
> >>>>>>
> >>>>>> I can take a look on the FFC side, but not today.
> >>>>>
> >>>>> Nice!
> >>>>>
> >>>>> With the update correction from Marie the numbers now looks like:
> >>>>>
> >>>>> With PETSc backend
> >>>>>
> >>>>> Tensor without Mixed space | 0.11211 0.11211 1
> >>>>> With 1 global dofs | 1.9482 1.9482 1
> >>>>> With 2 global dofs | 2.8725 2.8725 1
> >>>>> With 4 global dofs | 5.1959 5.1959 1
> >>>>> With 8 global dofs | 10.524 10.524 1
> >>>>> With 16 global dofs | 25.574 25.574 1
> >>>>>
> >>>>> With Epetra backend
> >>>>>
> >>>>> Tensor without Mixed space | 0.87544 0.87544 1
> >>>>> With 1 global dofs | 1.7089 1.7089 1
> >>>>> With 2 global dofs | 2.6868 2.6868 1
> >>>>> With 4 global dofs | 4.28 4.28 1
> >>>>> With 8 global dofs | 8.123 8.123 1
> >>>>> With 16 global dofs | 17.394 17.394 1
> >>>>>
> >>>>> Still a pretty big increase in time for just adding 16 scalar dofs to
> >>>>> a system of 274625 dofs in teh first place.
> >>>>
> >>>> I have seen this big slow down for large problems. The first issue,
> >>>> which was the computation of the sparsity pattern, has been 'resolved'
> >>>> by using boost::unordered_set. This comes at the expense of the a
> >>>> small slow down for regular problems.
> >>>>
> >>>> I also noticed that Epetra performs much better for these problems
> >>>> than PETSc does. We need to check the matrix initialisation, but it
> >>>> could ultimately be a limitation of the backends. Each matrix row
> >>>> corresponding to a global dof is full, and it may be that backends
> >>>> designed for large sparse matrices do not handle this well.
> >>>
> >>> How could inserting into the matrix be the bottle neck. In the test
> >>> script I attached I do not assemble any global dofs.
> >>
> >> I think that you've find that it is. It will be assembling zeroes in the
> >> global dof positions.
> >
> > I guess you are right. Is the sparsity pattern and also the tabulated
> > tensor based only on the MixedSpace formulation, and not on the actual
> > integral?
>
> The sparsity pattern is based on the dof map, which depends on the
> function spaces.
>
> > Is this a bug or feature?
>
> I would say just the natural approach. There is/was a Blueprint to avoid
> computing and assembling the zeroes in problems like Stokes, but I'm not
> sure that this would be worthwhile, since it would involve assembling
> non-matrix values, and most backends want to assemble dense local
> matrices into sparse global matrices.
Make sense. I guess the bottle neck is then the insertion into the global
matrix, where your suggested approach might improve the performance.
Sounds like a pretty hard fix though...
Johan
> Garth
>
> >>>> The best approach is probably to add the entire row at once for global
> >>>> dofs. This would require a modified assembler.
> >>>>
> >>>> There is a UFC Blueprint to identify global dofs:
> >>>> https://blueprints.launchpad.net/ufc/+spec/global-dofs
> >>>>
> >>>> If we can identify global dofs, we have a better chance of dealing
> >>>> with the problem properly. This includes running in parallel with
> >>>> global dofs.
> >>>
> >>> Do you envision to have integrals over global dofs to get separated
> >>> into its own tabulate tensor function? Then in DOLFIN we can assemble
> >>> the whole row/column in one loop and insert it into the Matrix in one
> >>> go.
> >>
> >> No - I had in mind possibly adding only cell-based dofs to the matrix,
> >> and adding the global rows into a global vector, then is that inserted
> >> at the end (as one row) into a matrix. I'm not advocating at this stage
> >> a change to the tabulate_foo interface.
> >
> > Ok, but you need its own tabulate_tensor function for the global dofs?
> >
> >>> Do you think we also need to recognise global dofs in UFL to properly
> >>> flesh out these integrals?
> >>
> >> Yes. For one to handle them properly in parallel since global dofs to
> >> not reside at mesh entity, and the domains are parallelised mesh-wise.
> >
> > Ok
> >
> > Johan
> >
> >> Garth
> >>
> >>> Johan
> >>>
> >>>> Garth
> >>>>
> >>>>> Johan
> >>>>>
> >>>>>> --
> >>>>>> Marie
> >>>>>>
> >>>>>>> I have not profiled any of this, but I just throw it out there. I
> >>>>>>> do not recognize any difference between for example Epetra or
> >>>>>>> PETSc backend as suggested in the fixed bug for building of
> >>>>>>> sparsity pattern with global dofs.
> >>>>>>>
> >>>>>>> My test has been done on a DOLFIN 0.9.9+. I haven't profiled it
> >>>>>>> yet.
> >>>>>>>
> >>>>>>> Output from summary:
> >>>>>>> Tensor without Mixed space | 0.11401 0.11401 1
> >>>>>>> With 1 global dofs | 0.40725 0.40725 1
> >>>>>>> With 2 global dofs | 0.94694 0.94694 1
> >>>>>>> With 4 global dofs | 2.763 2.763 1
> >>>>>>> With 8 global dofs | 9.6149 9.6149 1
> >>>>>>>
> >>>>>>> Also the amount of memory used to build the sparsity patter seams
> >>>>>>> to double for each step. The memory print for a 32x32x32 unit cube
> >>>>>>> with 16 global dofs was 1.6 GB memory(!?).
> >>>>>>>
> >>>>>>> Johan
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Mailing list: https://launchpad.net/~dolfin
> >>>>>>> Post to : dolfin@xxxxxxxxxxxxxxxxxxx
> >>>>>>> Unsubscribe : https://launchpad.net/~dolfin
> >>>>>>> More help : https://help.launchpad.net/ListHelp
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Mailing list: https://launchpad.net/~dolfin
> >>>>>> Post to : dolfin@xxxxxxxxxxxxxxxxxxx
> >>>>>> Unsubscribe : https://launchpad.net/~dolfin
> >>>>>> More help : https://help.launchpad.net/ListHelp
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Mailing list: https://launchpad.net/~dolfin
> >>>>>> Post to : dolfin@xxxxxxxxxxxxxxxxxxx
> >>>>>> Unsubscribe : https://launchpad.net/~dolfin
> >>>>>> More help : https://help.launchpad.net/ListHelp
> >>>>
> >>>> _______________________________________________
> >>>> Mailing list: https://launchpad.net/~dolfin
> >>>> Post to : dolfin@xxxxxxxxxxxxxxxxxxx
> >>>> Unsubscribe : https://launchpad.net/~dolfin
> >>>> More help : https://help.launchpad.net/ListHelp
References