← Back to team overview

dolfin team mailing list archive

dolfin parmetis bug?

 

Anders Logg wrote:
On Wed, Oct 07, 2009 at 05:39:05PM +0200, Patrick Riesen wrote:
hi, i catched up with dolfin 0.9.3 on my linux workstation. install &
compile went fine and running demos in serial seems to be ok as well.
i was trying to run the demos in parallel but i get errors with openmpi
as follows:
(this occured running any dem with mpirun -np xy ./demo where xy is
larger than 1, it did not occur for -np 1):

------------
{process output.....}

then suddenly

[vierzack01:12050] *** An error occurred in MPI_Barrier
[vierzack01:12049] *** An error occurred in MPI_Barrier
[vierzack01:12049] *** on communicator MPI_COMM_WORLD
[vierzack01:12049] *** MPI_ERR_COMM: invalid communicator
[vierzack01:12049] *** MPI_ERRORS_ARE_FATAL (goodbye)
[vierzack01:12050] *** on communicator MPI_COMM_WORLD
[vierzack01:12050] *** MPI_ERR_COMM: invalid communicator
[vierzack01:12050] *** MPI_ERRORS_ARE_FATAL (goodbye)
[vierzack01:12049] *** Process received signal ***
[vierzack01:12049] Signal: Segmentation fault (11)
[vierzack01:12049] Signal code: Address not mapped (1)
[vierzack01:12049] Failing at address: 0x4
[vierzack01:12050] *** Process received signal ***
[vierzack01:12050] Signal: Segmentation fault (11)
[vierzack01:12050] Signal code: Address not mapped (1)
[vierzack01:12050] Failing at address: 0x4
[vierzack01:12049] [ 0] /lib/libpthread.so.0 [0x7f0fd3be6410]
[vierzack01:12049] [ 1]
/home/priesen/num/openmpi/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x34)
[0x7f0fd475c1d4]
[vierzack01:12049] [ 2]
/home/priesen/num/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x11b)
[0x7f0fd48a8b0b]
[vierzack01:12049] [ 3]
/scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0(_ZN6dolfin17SubSystemsManager12finalize_mpiEv+0x35)
[0x7f0fd7bbfb15]
[vierzack01:12049] [ 4]
/scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0(_ZN6dolfin17SubSystemsManagerD1Ev+0xe)
[0x7f0fd7bbfb2e]
[vierzack01:12049] [ 5] /lib/libc.so.6(__cxa_finalize+0x6c) [0x7f0fd39cee0c]
[vierzack01:12049] [ 6]
/scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0 [0x7f0fd7aa65d3]
[vierzack01:12049] *** End of error message ***
[vierzack01:12050] [ 0] /lib/libpthread.so.0 [0x7fd707916410]
[vierzack01:12050] [ 1]
/home/priesen/num/openmpi/lib/libopen-rte.so.0(orte_smr_base_set_proc_state+0x34)
[0x7fd70848c1d4]
[vierzack01:12050] [ 2]
/home/priesen/num/openmpi/lib/libmpi.so.0(ompi_mpi_finalize+0x11b)
[0x7fd7085d8b0b]
[vierzack01:12050] [ 3]
/scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0(_ZN6dolfin17SubSystemsManager12finalize_mpiEv+0x35)
[0x7fd70b8efb15]
[vierzack01:12050] [ 4]
/scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0(_ZN6dolfin17SubSystemsManagerD1Ev+0xe)
[0x7fd70b8efb2e]
[vierzack01:12050] [ 5] /lib/libc.so.6(__cxa_finalize+0x6c) [0x7fd7076fee0c]
[vierzack01:12050] [ 6]
/scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0 [0x7fd70b7d65d3]
[vierzack01:12050] *** End of error message ***
mpirun noticed that job rank 0 with PID 12049 on node vierzack01 exited
on signal 15 (Terminated).
1 additional process aborted (not shown)
------------

Is this a openmpi error?
Is there a specific version of openmpi required for dolfin?
mine is 1.2.8 and it worked up to dolfin 0.9.2

No idea what goes wrong. My version of OpenMPI is 1.3.2-3ubuntu1.

Hi, so i installed openmpi-1.3.3 and it's still the same problem.
i tried to catch the error, here is a backtrace from attaching gdb via petsc and having dolfin in debug mode:

#0  0x00007ff27dd7c07b in raise () from /lib/libc.so.6
#1  0x00007ff27dd7d84e in abort () from /lib/libc.so.6
#2  0x00007ff27f926ea8 in Petsc_MPI_AbortOnError (comm=0x7fff8a060448,
    flag=0x7fff8a060434) at init.c:142
#3  0x00007ff27ec44e0f in ompi_errhandler_invoke ()
   from /home/priesen/num/openmpi-1.3.3/lib/libmpi.so.0
#4  0x00007ff281d83714 in ParMETIS_V3_PartMeshKway ()
   from /scratch-second/priesen/FEniCS/build/lib/libdolfin.so.0
#5  0x00007ff281c8ba0b in dolfin::MeshPartitioning::compute_partition (
    cell_partition=@0x7fff8a0a8900, mesh_data=@0x7fff8a0a8970)
    at dolfin/mesh/MeshPartitioning.cpp:588
#6  0x00007ff281c8bc41 in dolfin::MeshPartitioning::partition (
    mesh=@0x7fff8a0a8b50, mesh_data=@0x7fff8a0a8970)
    at dolfin/mesh/MeshPartitioning.cpp:74
#7 0x00007ff281c6fb42 in Mesh (this=0x7fff8a0a8b50, filename=@0x7fff8a0a9cd0)
    at dolfin/mesh/Mesh.cpp:67
#8  0x0000000000429b60 in main ()


frame 5 seems to interesting, , so :

(gdb) f 5
#5  0x00007ff281c8ba0b in dolfin::MeshPartitioning::compute_partition (
    cell_partition=@0x7fff8a0a8900, mesh_data=@0x7fff8a0a8970)
    at dolfin/mesh/MeshPartitioning.cpp:588
588                                &edgecut, part, &(*comm));


and then the lines:

(gdb) l
583       // Call ParMETIS to partition mesh
584       ParMETIS_V3_PartMeshKway(elmdist, eptr, eind,
585                                elmwgt, &wgtflag, &numflag, &ncon,
586                                &ncommonnodes, &nparts,
587                                tpwgts, ubvec, options,
588                                &edgecut, part, &(*comm));
589       info("Partitioned mesh, edge cut is %d.", edgecut);
590
591       // Copy mesh_data
592       cell_partition.clear();


when i check the input arguments, there is elmwgt which has no address:

(gdb) p elmwgt
$4 = (int *) 0x0
(gdb) p *elmwgt
Cannot access memory at address 0x0


so, here i do not know any further, please tell me what i could possibly else check to determine what goes wrong or maybe you know it already.

regards,
patrick


DOLFIN wasn't parallel before 0.9.3 so I'm not sure what you mean by
it working up to 0.9.2.

--
Anders


------------------------------------------------------------------------

_______________________________________________
FEniCS-users mailing list
FEniCS-users@xxxxxxxxxx
http://fenics.org/mailman/listinfo/fenics-users