← Back to team overview

yade-mpi team mailing list archive

Re: deadlock fixed (?)


I found a new problem I don't understand with "-ms". :(
It doesnt occur all the time, here is a way:
mpiexec --tag-output -n 3 ../../yade-mpi testMPI_2D_BUG_DK.py 50 50 -ms

The differences between the attached script and the one in trunk are:

With the trunk version the problem does not occur.

*[1,1]<stderr>:Running script testMPI_2D_BUG_DK.py[1,0]<stderr>:Running
script testMPI_2D_BUG_DK.py[1,2]<stderr>:Running script
testMPI_2D_BUG_DK.py[1,1]<stdout>:Worker1: triggers collider at iter
354[1,2]<stdout>:Worker2: triggers collider at iter 354[1,0]<stdout>:init
Done in  MASTER 0[1,2]<stdout>:Worker2: triggers collider at iter
501[1,1]<stdout>:Worker1: triggers collider at iter
501[1,2]<stderr>:Traceback (most recent call last):[1,1]<stderr>:Traceback
(most recent call last):[1,2]<stderr>:  File "../../yade-mpi", line 244, in
runScript[1,2]<stderr>:    execfile(script,globals())[1,2]<stderr>:  File
"testMPI_2D_BUG_DK.py", line 114, in <module>[1,2]<stderr>:
 mp.mpirun(NSTEPS)[1,2]<stderr>:  File
"/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 676, in
mpirun[1,2]<stderr>:    mergeScene()[1,1]<stderr>:  File "../../yade-mpi",
line 244, in runScript[1,2]<stderr>:  File
"/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 423, in
mergeScene[1,2]<stderr>:    O.subD.mergeOp()[1,2]<stderr>:RuntimeError:
 execfile(script,globals())[1,1]<stderr>:  File "testMPI_2D_BUG_DK.py",
line 114, in <module>[1,1]<stderr>:    mp.mpirun(NSTEPS)[1,1]<stderr>:
 File "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 676,
in mpirun[1,1]<stderr>:    mergeScene()[1,1]<stderr>:  File
"/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 423, in
mergeScene[1,1]<stderr>:    O.subD.mergeOp()[1,1]<stderr>:RuntimeError:
has exited due to process rank 2 with PID 19994 onnode dt-medXXX exiting
improperly. There are three reasons this could occur:1. this process did
not call "init" before exiting, but others inthe job did. This can cause a
job to hang indefinitely while it waitsfor all processes to call "init". By
rule, if one process calls "init",then ALL processes must call "init" prior
to termination.*

On Thu, 6 Jun 2019 at 11:17, Bruno Chareyre <bruno.chareyre@xxxxxxxxxxxxxxx>

> Hi,
> @François, now I understand why there is no deadlock (point 1/), thanks.
> That was difficult for me to realize, Deepak helped. :)
> About checkCollider and global barriers: *we definitely want to avoid any
> barrier*.
> The reason is: there is already a kind of barrier(*) at each iteration
> since master has to receive forces (before Newton), and send back wall
> positions (after Newton) (let's call "master sync" this sequence
> forces+Newton+positions).
> Between two master syncs all workers should run at max speed without
> waiting for another global event.
> When we send positions at iteration N we know in each SD if collision
> detection is needed at the begining of iteration N+1. It can be
> communicated to master. Then, at least two options:
> - master will tell everyone at the next master sync. In that case global
> collision detection would be delayed by one iteration, it will occur at
> N+2. That delay is technically perfectly fine since the SD which really
> need immediate colliding will do it spontaneously at N+1 regardless of
> global instructions. The downside of this approach is that if only one
> subdomain is colliding at N+1, this SD will be slower and others will have
> to wait for it to finish for the next master sync. Then collision detection
> again at N+2, this would probably double the total cost of collision
> detection.
> - send yes/no to master in the "positions" stage (for the moment nothing
> is sent to master in that step) + complete master sync with an additional
> communication from master to workers.
> Side question: what's the use of engine "waitForcesRunner"? I removed it
> and it works just as well.
> (*) It's only a partial barrier since some subdomains may not interact
> with master, but we can change that to force all domains to send at least a
> yes/no to master.
> Bruno
> On Tue, 4 Jun 2019 at 16:41, François <francois.kneib@xxxxxxxxx> wrote:
>> Concerning the non blocking MPI_ISend, using MPI_Wait was not necessary
>>> with the use of a basic global barrier. I'm afraid that looping on send
>>> requests and wait for them to complete can slow down the communications, as
>>> you force (the send) order one more time (the receive order is already
>>> forced here <https://gitlab.com/yade-dev/trunk/blob/mpi/py/mpy.py#L641>
>>> ).
>> ... but not using a global barrier allows the first threads that finished
>> their sends/recvs to start the next DEM iteration before the others, +1 for
>> your fix so finally I don't know what's better. Anyway that's probably not
>> meaningful compared to the interaction loop timings.
>> --
>> Mailing list: https://launchpad.net/~yade-mpi
>> Post to     : yade-mpi@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~yade-mpi
>> More help   : https://help.launchpad.net/ListHelp
> --
> --
> _______________
> Bruno Chareyre
> Associate Professor
> ENSE³ - Grenoble INP
> Lab. 3SR
> BP 53
> 38041 Grenoble cedex 9
> Tél : +33 4 56 52 86 21
> ________________
> Email too brief?
> Here's why: email charter
> <https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>

Bruno Chareyre
Associate Professor
ENSE³ - Grenoble INP
Lab. 3SR
BP 53
38041 Grenoble cedex 9
Tél : +33 4 56 52 86 21

Email too brief?
Here's why: email charter
# In order for mpy module to work, don't forget to make a symlink to yade executable named "yadeimport.py":
# ln -s path/to/yade/yade-version path/to/yade/yadeimport.py
# Possible executions of this script
### Parallel:
# mpiexec -n 4 yade-mpi -n -x testMPIxNxM.py
# mpiexec -n 4 yade-mpi  -n -x testMPIxN.py N M # (n-1) subdomains with NxM spheres each
### Monolithic:
# yade-mpi -n -x testMPIxN.py 
# yade-mpi -n -x testMPIxN.py N M
# yade-mpi -n -x testMPIxN.py N M n
# in last line the optional argument 'n' has the same meaning as with mpiexec, i.e. total number of bodies will be (n-1)*N*M but on single core
### Openmp:
# yade-mpi -j4 -n -x testMPIxN.py N M n
### Nexted MPI * OpenMP
# needs testing...
This script simulates spheres falling on a plate using a distributed memory approach based on mpy module
The number of spheres assigned to one particular process (aka 'worker') is N*M, they form a regular patern.
The master process (rank=0) has no spheres assigned; it is in charge of getting the total force on the plate
The number of subdomains depends on argument 'n' of mpiexec. Since rank=0 is not assigned a regular subdomain the total number of spheres is (n-1)*N*M


NSTEPS=1000 #turn it >0 to see time iterations, else only initilization TODO!HACK
#NSTEPS=50 #turn it >0 to see time iterations, else only initilization
N=50; M=50; #(columns, rows) per thread

if("-ms" in sys.argv):
else: mergeSplit=False

if("-bc" in sys.argv):
else: bodyCopy=False

# Check MPI world
# This is to know if it was run with or without mpiexec (see preamble of this script)
import os
rank = os.getenv('OMPI_COMM_WORLD_RANK')
if rank is not None: #mpiexec was used
else: #non-mpi execution, numThreads will still be used as multiplier for the problem size (2 => multiplier is 1)
	numThreads=2 if len(sys.argv)<4 else (int(sys.argv[3]))
	print "numThreads",numThreads
if len(sys.argv)>1: #we then assume N,M are provided as 1st and 2nd cmd line arguments
	N=int(sys.argv[1]); M=int(sys.argv[2])

############  Build a scene (we use Yade's pre-filled scene)  ############

# sequential grain colors
import colorsys
colorScale = (Vector3(colorsys.hsv_to_rgb(value*1.0/numThreads, 1, 1)) for value in range(0, numThreads))

#add spheres
for sd in range(0,numThreads-1):
	col = next(colorScale)
	for i in range(N):#(numThreads-1) x N x M spheres, one thread is for master and will keep only the wall, others handle spheres
		for j in range(M):
			id = O.bodies.append(sphere((sd*N+i+j/30.,j,0),0.500,color=col)) #a small shift in x-positions of the rows to break symmetry
	if rank is not None:# assigning subdomain!=0 in single thread would freeze the particles (Newton skips them)
		for id in ids: O.bodies[id].subdomain = sd+1


collider.verletDist = 0.5
newton.gravity=(0,-10,0) #else nothing would move
tsIdx=O.engines.index(timeStepper) #remove the automatic timestepper. Very important: we don't want subdomains to use many different timesteps...
O.dt=0.001 #this very small timestep will make it possible to run 2000 iter without merging
#O.dt=0.1*PWaveTimeStep() #very important, we don't want subdomains to use many different timesteps...

#########  RUN  ##########
def collectTiming():
	created = os.path.isfile("collect.dat")
	if not created: f.write("numThreads mpi omp Nspheres N M runtime \n")
	from yade import timing
	f.write(str(numThreads)+" "+str(os.getenv('OMPI_COMM_WORLD_SIZE'))+" "+os.getenv('OMP_NUM_THREADS')+" "+str(N*M*(numThreads-1))+" "+str(N)+" "+str(M)+" "+str(timing.runtime())+"\n")

if rank is None: #######  Single-core  ######
	#print "num bodies:",len(O.bodies)
	from yade import timing
	print "num. bodies:",len([b for b in O.bodies]),len(O.bodies)
	print "Total force on floor=",O.forces.f(WALL_ID)[1]
else: #######  MPI  ######
	#import yade's mpi module
	from yade import mpy as mp
	# customize
	mp.ACCUMULATE_FORCES=True #trigger force summation on master's body (here WALL_ID)
	mp.ERASE_REMOTE=True #erase bodies not interacting wit a given subdomain?
	mp.OPTIMIZE_COM=True #L1-optimization: pass a list of double instead of a list of states
	mp.USE_CPP_MPI=True and mp.OPTIMIZE_COM #L2-optimization: workaround python by passing a vector<double> at the c++ level
	mp.COPY_MIRROR_BODIES_WHEN_COLLIDE = bodyCopy and not mergeSplit

	print "num. bodies:",len([b for b in O.bodies]),len(O.bodies)
	if rank==0:
		mp.mprint( "Total force on floor="+str(O.forces.f(WALL_ID)[1]))
	else: mp.mprint( "Partial force on floor="+str(O.forces.f(WALL_ID)[1]))
	if rank==0: O.save('mergedScene.yade')

Follow ups