Thread Previous • Date Previous • Date Next • Thread Next |
On 10/11/10 16:10, Anders Logg wrote:
On Wed, Nov 10, 2010 at 03:58:13PM +0000, Garth N. Wells wrote:On 10/11/10 15:47, Anders Logg wrote:On Wed, Nov 10, 2010 at 02:47:30PM +0000, Garth N. Wells wrote:Nice to see multi-thread assembly being added. We should look at adding support for the multi-threaded version of SuperLU. What other multi-thread solvers are out there?Yes, that would be good, but I don't know which solvers are available.I haven't looked at the code in great detail, but are element tensors being added to the global tensor is a thread-safe fashion? Both PETSc and Trilinos are not thread-safe.Yes, they should. That's the main point. It's a very simple algorithm which just partitions the matrix row by row and makes each process responsible for a chunk of rows.Would it be better to partition the mesh (using Metis) and then renumber dofs? That way the 'all_in_range' case would be maximised and the 'some_in_range' would be minimised. If the rows are distributed, mixed elements will be a mess because the global rows are far apart (using the FFC-generated dof map).Renumbering is definitely important for getting good speedup. This hasn't been added yet but was implemented in the prototype version (which was stand-alone from DOLFIN).
What about partitioning of the mesh? Garth
During assembly, all processes iterate over the entire mesh and on each cell does one of three things: 1. all_in_range: tabulate_tensor as usual and add 2. none_in_range: skip tabulate_tensor (continue) 3. some_in_range: tabulate_tensor and insert only rows in range Didem Unat (PhD student at UCLA/Simula) tried this in a simple prototype code and got very good speedups (up to a factor 7 on an eight-core machine)I hope the test wasn't for Poisson on a dolfin::UnitFoo mesh.That would have been much better! The best speedups are seen for "complex" forms where tabulate_tensor takes a significant portion of the time. The best speedup was obtained for the NS nonlinear term in 3D (unit cube). For 3D elasticity the results were almost as good, a bit worse for Poisson. The speedup was reduced slightly by going to higher order but not much (from 7 to 6 when going to cubics). -- AndersGarthso it's just a matter of doing the same thing as part of DOLFIN (which is a bit trickier since some of the data access is hidden). The current implementation in DOLFIN seems to work and give some small speedup but I need to do some more testing.Rather than having two assembly classes, would it be worth using OpenMP instead? I experimented with OpenMP some time ago, but never added it since at the time it required a very recent version of gcc. This shouldn't be a problem now.I don't think this would work with OpenMP since we need to control how the rows are inserted. If this works out and we get good speedups, we could consider replacing Assembler by MulticoreAssembler. It's not that much extra code and it's pretty clean. I haven't tried yet, but it should also work in combination with MPI (each node has a part of the mesh and does multi-core assembly).
Thread Previous • Date Previous • Date Next • Thread Next |