← Back to team overview

dolfin team mailing list archive

Re: multi-thread assembly

 

On Wed, Nov 10, 2010 at 03:58:13PM +0000, Garth N. Wells wrote:
>
>
> On 10/11/10 15:47, Anders Logg wrote:
> >On Wed, Nov 10, 2010 at 02:47:30PM +0000, Garth N. Wells wrote:
> >>Nice to see multi-thread assembly being added. We should look at
> >>adding support for the multi-threaded version of SuperLU. What other
> >>multi-thread solvers are out there?
> >
> >Yes, that would be good, but I don't know which solvers are available.
> >
> >>I haven't looked at the code in great detail, but are element
> >>tensors being added to the global tensor is a thread-safe fashion?
> >>Both PETSc and Trilinos are not thread-safe.
> >
> >Yes, they should. That's the main point. It's a very simple algorithm
> >which just partitions the matrix row by row and makes each process
> >responsible for a chunk of rows.
>
> Would it be better to partition the mesh (using Metis) and then
> renumber dofs? That way the 'all_in_range' case would be maximised
> and the 'some_in_range' would be minimised. If the rows are
> distributed, mixed elements will be a mess because the global rows
> are far apart (using the FFC-generated dof map).

Renumbering is definitely important for getting good speedup. This
hasn't been added yet but was implemented in the prototype version
(which was stand-alone from DOLFIN).

> >During assembly, all processes
> >iterate over the entire mesh and on each cell does one of three things:
> >
> >   1. all_in_range:  tabulate_tensor as usual and add
> >   2. none_in_range: skip tabulate_tensor (continue)
> >   3. some_in_range: tabulate_tensor and insert only rows in range
> >
> >Didem Unat (PhD student at UCLA/Simula) tried this in a simple
> >prototype code and got very good speedups (up to a factor 7 on an
> >eight-core machine)
>
> I hope the test wasn't for Poisson on a dolfin::UnitFoo mesh.

That would have been much better! The best speedups are seen for
"complex" forms where tabulate_tensor takes a significant portion of
the time. The best speedup was obtained for the NS nonlinear term in
3D (unit cube). For 3D elasticity the results were almost as good, a
bit worse for Poisson. The speedup was reduced slightly by going to
higher order but not much (from 7 to 6 when going to cubics).

--
Anders


> Garth
>
> >so it's just a matter of doing the same thing as
> >part of DOLFIN (which is a bit trickier since some of the data access
> >is hidden). The current implementation in DOLFIN seems to work and
> >give some small speedup but I need to do some more testing.
> >
> >>Rather than having two assembly classes, would it be worth using
> >>OpenMP instead? I experimented with OpenMP some time ago, but never
> >>added it since at the time it required a very recent version of gcc.
> >>This shouldn't be a problem now.
> >
> >I don't think this would work with OpenMP since we need to control how
> >the rows are inserted.
> >
> >If this works out and we get good speedups, we could consider
> >replacing Assembler by MulticoreAssembler. It's not that much extra
> >code and it's pretty clean. I haven't tried yet, but it should also
> >work in combination with MPI (each node has a part of the mesh and
> >does multi-core assembly).
> >



Follow ups

References