← Back to team overview

yade-mpi team mailing list archive

Re: deadlock fixed (?)

 

Hello,

> I found a new problem I don't understand with "-ms". :(
> It doesnt occur all the time, here is a way:
> mpiexec --tag-output -n 3 ../../yade-mpi testMPI_2D_BUG_DK.py 50 50 -ms
>

Hi, I have fixed this now in commit : f98281dc. The issue was during each
split after merge the ranks of Subdomains (defined in CPP) were getting
erased, since commit : 68e358b9.

On Thu, Jun 6, 2019 at 11:54 AM Bruno Chareyre <
bruno.chareyre@xxxxxxxxxxxxxxx> wrote:

> I found a new problem I don't understand with "-ms". :(
> It doesnt occur all the time, here is a way:
> mpiexec --tag-output -n 3 ../../yade-mpi testMPI_2D_BUG_DK.py 50 50 -ms
>
> The differences between the attached script and the one in trunk are:
> loopOnSortedInteractions=True
> MERGE_W_INTERACTIONS=True
>
> With the trunk version the problem does not occur.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *[1,1]<stderr>:Running script testMPI_2D_BUG_DK.py[1,0]<stderr>:Running
> script testMPI_2D_BUG_DK.py[1,2]<stderr>:Running script
> testMPI_2D_BUG_DK.py[1,1]<stdout>:Worker1: triggers collider at iter
> 354[1,2]<stdout>:Worker2: triggers collider at iter 354[1,0]<stdout>:init
> Done in  MASTER 0[1,2]<stdout>:Worker2: triggers collider at iter
> 501[1,1]<stdout>:Worker1: triggers collider at iter
> 501[1,2]<stderr>:Traceback (most recent call last):[1,1]<stderr>:Traceback
> (most recent call last):[1,2]<stderr>:  File "../../yade-mpi", line 244, in
> runScript[1,2]<stderr>:    execfile(script,globals())[1,2]<stderr>:  File
> "testMPI_2D_BUG_DK.py", line 114, in <module>[1,2]<stderr>:
>  mp.mpirun(NSTEPS)[1,2]<stderr>:  File
> "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 676, in
> mpirun[1,2]<stderr>:    mergeScene()[1,1]<stderr>:  File "../../yade-mpi",
> line 244, in runScript[1,2]<stderr>:  File
> "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 423, in
> mergeScene[1,2]<stderr>:    O.subD.mergeOp()[1,2]<stderr>:RuntimeError:
> vector::_M_default_append[1,1]<stderr>:
>  execfile(script,globals())[1,1]<stderr>:  File "testMPI_2D_BUG_DK.py",
> line 114, in <module>[1,1]<stderr>:    mp.mpirun(NSTEPS)[1,1]<stderr>:
>  File "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 676,
> in mpirun[1,1]<stderr>:    mergeScene()[1,1]<stderr>:  File
> "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 423, in
> mergeScene[1,1]<stderr>:    O.subD.mergeOp()[1,1]<stderr>:RuntimeError:
> vector::_M_default_append--------------------------------------------------------------------------mpiexec
> has exited due to process rank 2 with PID 19994 onnode dt-medXXX exiting
> improperly. There are three reasons this could occur:1. this process did
> not call "init" before exiting, but others inthe job did. This can cause a
> job to hang indefinitely while it waitsfor all processes to call "init". By
> rule, if one process calls "init",then ALL processes must call "init" prior
> to termination.*
>
>
>
>
>
>
>
>
> On Thu, 6 Jun 2019 at 11:17, Bruno Chareyre <
> bruno.chareyre@xxxxxxxxxxxxxxx> wrote:
>
>> Hi,
>> @François, now I understand why there is no deadlock (point 1/), thanks.
>> That was difficult for me to realize, Deepak helped. :)
>>
>> About checkCollider and global barriers: *we definitely want to avoid
>> any barrier*.
>> The reason is: there is already a kind of barrier(*) at each iteration
>> since master has to receive forces (before Newton), and send back wall
>> positions (after Newton) (let's call "master sync" this sequence
>> forces+Newton+positions).
>> Between two master syncs all workers should run at max speed without
>> waiting for another global event.
>> When we send positions at iteration N we know in each SD if collision
>> detection is needed at the begining of iteration N+1. It can be
>> communicated to master. Then, at least two options:
>> - master will tell everyone at the next master sync. In that case global
>> collision detection would be delayed by one iteration, it will occur at
>> N+2. That delay is technically perfectly fine since the SD which really
>> need immediate colliding will do it spontaneously at N+1 regardless of
>> global instructions. The downside of this approach is that if only one
>> subdomain is colliding at N+1, this SD will be slower and others will have
>> to wait for it to finish for the next master sync. Then collision detection
>> again at N+2, this would probably double the total cost of collision
>> detection.
>> - send yes/no to master in the "positions" stage (for the moment nothing
>> is sent to master in that step) + complete master sync with an additional
>> communication from master to workers.
>>
>> Side question: what's the use of engine "waitForcesRunner"? I removed it
>> and it works just as well.
>>
>> (*) It's only a partial barrier since some subdomains may not interact
>> with master, but we can change that to force all domains to send at least a
>> yes/no to master.
>>
>> Bruno
>>
>>
>>
>>
>> On Tue, 4 Jun 2019 at 16:41, François <francois.kneib@xxxxxxxxx> wrote:
>>
>>> Concerning the non blocking MPI_ISend, using MPI_Wait was not necessary
>>>> with the use of a basic global barrier. I'm afraid that looping on send
>>>> requests and wait for them to complete can slow down the communications, as
>>>> you force (the send) order one more time (the receive order is already
>>>> forced here <https://gitlab.com/yade-dev/trunk/blob/mpi/py/mpy.py#L641>
>>>> ).
>>>>
>>> ... but not using a global barrier allows the first threads that
>>> finished their sends/recvs to start the next DEM iteration before the
>>> others, +1 for your fix so finally I don't know what's better. Anyway
>>> that's probably not meaningful compared to the interaction loop timings.
>>> --
>>> Mailing list: https://launchpad.net/~yade-mpi
>>> Post to     : yade-mpi@xxxxxxxxxxxxxxxxxxx
>>> Unsubscribe : https://launchpad.net/~yade-mpi
>>> More help   : https://help.launchpad.net/ListHelp
>>>
>>
>>
>> --
>> --
>> _______________
>> Bruno Chareyre
>> Associate Professor
>> ENSE³ - Grenoble INP
>> Lab. 3SR
>> BP 53
>> 38041 Grenoble cedex 9
>> Tél : +33 4 56 52 86 21
>> ________________
>>
>> Email too brief?
>> Here's why: email charter
>> <https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>
>>
>
>
> --
> --
> _______________
> Bruno Chareyre
> Associate Professor
> ENSE³ - Grenoble INP
> Lab. 3SR
> BP 53
> 38041 Grenoble cedex 9
> Tél : +33 4 56 52 86 21
> ________________
>
> Email too brief?
> Here's why: email charter
> <https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>
> --
> Mailing list: https://launchpad.net/~yade-mpi
> Post to     : yade-mpi@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~yade-mpi
> More help   : https://help.launchpad.net/ListHelp
>

Follow ups

References