← Back to team overview

dolfin team mailing list archive

Re: multi-thread assembly

 



On 10/11/10 16:08, Andy Ray Terrel wrote:
On Wed, Nov 10, 2010 at 9:55 AM, Garth N. Wells<gnw20@xxxxxxxxx>  wrote:


On 10/11/10 15:53, Andy Ray Terrel wrote:

On Wed, Nov 10, 2010 at 9:47 AM, Anders Logg<logg@xxxxxxxxx>    wrote:

On Wed, Nov 10, 2010 at 02:47:30PM +0000, Garth N. Wells wrote:

Nice to see multi-thread assembly being added. We should look at
adding support for the multi-threaded version of SuperLU. What other
multi-thread solvers are out there?

Yes, that would be good, but I don't know which solvers are available.

SuperLU tends to die on large problems.  Mumps is a much better option.


MUMPS is MPI-based. SuperLU has a multi-threaded version for shared memory
machines.

Garth

Yes but you compile it to take advantage of MPI's shared memory message passing.


That's a implementation detail of MPI - you still need to supply MUMPS with a partitioned matrix. With the multi-threaded assembly we don't have a partitioned matrix.

Garth



I haven't looked at the code in great detail, but are element
tensors being added to the global tensor is a thread-safe fashion?
Both PETSc and Trilinos are not thread-safe.

Yes, they should. That's the main point. It's a very simple algorithm
which just partitions the matrix row by row and makes each process
responsible for a chunk of rows. During assembly, all processes
iterate over the entire mesh and on each cell does one of three things:

  1. all_in_range:  tabulate_tensor as usual and add
  2. none_in_range: skip tabulate_tensor (continue)
  3. some_in_range: tabulate_tensor and insert only rows in range

Didem Unat (PhD student at UCLA/Simula) tried this in a simple
prototype code and got very good speedups (up to a factor 7 on an
eight-core machine) so it's just a matter of doing the same thing as
part of DOLFIN (which is a bit trickier since some of the data access
is hidden). The current implementation in DOLFIN seems to work and
give some small speedup but I need to do some more testing.

Rather than having two assembly classes, would it be worth using
OpenMP instead? I experimented with OpenMP some time ago, but never
added it since at the time it required a very recent version of gcc.
This shouldn't be a problem now.

I don't think this would work with OpenMP since we need to control how
the rows are inserted.

If this works out and we get good speedups, we could consider
replacing Assembler by MulticoreAssembler. It's not that much extra
code and it's pretty clean. I haven't tried yet, but it should also
work in combination with MPI (each node has a part of the mesh and
does multi-core assembly).

--
Anders

_______________________________________________
Mailing list: https://launchpad.net/~dolfin
Post to     : dolfin@xxxxxxxxxxxxxxxxxxx
Unsubscribe : https://launchpad.net/~dolfin
More help   : https://help.launchpad.net/ListHelp





References