← Back to team overview

yade-mpi team mailing list archive

Re: deadlock fixed (?)

 

I probably lost track of some discussion but I realize just now that I'm
running mpi with python 2.7...
So we can't compile it with Py3 at the moment?
Bruno

On Thu, 6 Jun 2019 at 17:21, Bruno Chareyre <bruno.chareyre@xxxxxxxxxxxxxxx>
wrote:

> After https://gitlab.com/yade-dev/trunk/commit/1bad411b everything runs
> in PyRunner for standard iterations.
> It's not possible to merge-split with this method since it would lead the
> scene to erase itself.
> I also made sending force interactive and avoioded one collisiion
> detection, with no clear improvement in speed though...
> Bruno
>
>
> On Thu, 6 Jun 2019 at 15:30, Deepak Kn <deepak.kn1990@xxxxxxxxx> wrote:
>
>> Hello,
>>
>>> I found a new problem I don't understand with "-ms". :(
>>> It doesnt occur all the time, here is a way:
>>> mpiexec --tag-output -n 3 ../../yade-mpi testMPI_2D_BUG_DK.py 50 50 -ms
>>>
>>
>> Hi, I have fixed this now in commit : f98281dc. The issue was during each
>> split after merge the ranks of Subdomains (defined in CPP) were getting
>> erased, since commit : 68e358b9.
>>
>> On Thu, Jun 6, 2019 at 11:54 AM Bruno Chareyre <
>> bruno.chareyre@xxxxxxxxxxxxxxx> wrote:
>>
>>> I found a new problem I don't understand with "-ms". :(
>>> It doesnt occur all the time, here is a way:
>>> mpiexec --tag-output -n 3 ../../yade-mpi testMPI_2D_BUG_DK.py 50 50 -ms
>>>
>>> The differences between the attached script and the one in trunk are:
>>> loopOnSortedInteractions=True
>>> MERGE_W_INTERACTIONS=True
>>>
>>> With the trunk version the problem does not occur.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *[1,1]<stderr>:Running script testMPI_2D_BUG_DK.py[1,0]<stderr>:Running
>>> script testMPI_2D_BUG_DK.py[1,2]<stderr>:Running script
>>> testMPI_2D_BUG_DK.py[1,1]<stdout>:Worker1: triggers collider at iter
>>> 354[1,2]<stdout>:Worker2: triggers collider at iter 354[1,0]<stdout>:init
>>> Done in  MASTER 0[1,2]<stdout>:Worker2: triggers collider at iter
>>> 501[1,1]<stdout>:Worker1: triggers collider at iter
>>> 501[1,2]<stderr>:Traceback (most recent call last):[1,1]<stderr>:Traceback
>>> (most recent call last):[1,2]<stderr>:  File "../../yade-mpi", line 244, in
>>> runScript[1,2]<stderr>:    execfile(script,globals())[1,2]<stderr>:  File
>>> "testMPI_2D_BUG_DK.py", line 114, in <module>[1,2]<stderr>:
>>>  mp.mpirun(NSTEPS)[1,2]<stderr>:  File
>>> "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 676, in
>>> mpirun[1,2]<stderr>:    mergeScene()[1,1]<stderr>:  File "../../yade-mpi",
>>> line 244, in runScript[1,2]<stderr>:  File
>>> "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 423, in
>>> mergeScene[1,2]<stderr>:    O.subD.mergeOp()[1,2]<stderr>:RuntimeError:
>>> vector::_M_default_append[1,1]<stderr>:
>>>  execfile(script,globals())[1,1]<stderr>:  File "testMPI_2D_BUG_DK.py",
>>> line 114, in <module>[1,1]<stderr>:    mp.mpirun(NSTEPS)[1,1]<stderr>:
>>>  File "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 676,
>>> in mpirun[1,1]<stderr>:    mergeScene()[1,1]<stderr>:  File
>>> "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 423, in
>>> mergeScene[1,1]<stderr>:    O.subD.mergeOp()[1,1]<stderr>:RuntimeError:
>>> vector::_M_default_append--------------------------------------------------------------------------mpiexec
>>> has exited due to process rank 2 with PID 19994 onnode dt-medXXX exiting
>>> improperly. There are three reasons this could occur:1. this process did
>>> not call "init" before exiting, but others inthe job did. This can cause a
>>> job to hang indefinitely while it waitsfor all processes to call "init". By
>>> rule, if one process calls "init",then ALL processes must call "init" prior
>>> to termination.*
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, 6 Jun 2019 at 11:17, Bruno Chareyre <
>>> bruno.chareyre@xxxxxxxxxxxxxxx> wrote:
>>>
>>>> Hi,
>>>> @François, now I understand why there is no deadlock (point 1/),
>>>> thanks. That was difficult for me to realize, Deepak helped. :)
>>>>
>>>> About checkCollider and global barriers: *we definitely want to avoid
>>>> any barrier*.
>>>> The reason is: there is already a kind of barrier(*) at each iteration
>>>> since master has to receive forces (before Newton), and send back wall
>>>> positions (after Newton) (let's call "master sync" this sequence
>>>> forces+Newton+positions).
>>>> Between two master syncs all workers should run at max speed without
>>>> waiting for another global event.
>>>> When we send positions at iteration N we know in each SD if collision
>>>> detection is needed at the begining of iteration N+1. It can be
>>>> communicated to master. Then, at least two options:
>>>> - master will tell everyone at the next master sync. In that case
>>>> global collision detection would be delayed by one iteration, it will occur
>>>> at N+2. That delay is technically perfectly fine since the SD which really
>>>> need immediate colliding will do it spontaneously at N+1 regardless of
>>>> global instructions. The downside of this approach is that if only one
>>>> subdomain is colliding at N+1, this SD will be slower and others will have
>>>> to wait for it to finish for the next master sync. Then collision detection
>>>> again at N+2, this would probably double the total cost of collision
>>>> detection.
>>>> - send yes/no to master in the "positions" stage (for the moment
>>>> nothing is sent to master in that step) + complete master sync with an
>>>> additional communication from master to workers.
>>>>
>>>> Side question: what's the use of engine "waitForcesRunner"? I removed
>>>> it and it works just as well.
>>>>
>>>> (*) It's only a partial barrier since some subdomains may not interact
>>>> with master, but we can change that to force all domains to send at least a
>>>> yes/no to master.
>>>>
>>>> Bruno
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 4 Jun 2019 at 16:41, François <francois.kneib@xxxxxxxxx> wrote:
>>>>
>>>>> Concerning the non blocking MPI_ISend, using MPI_Wait was not
>>>>>> necessary with the use of a basic global barrier. I'm afraid that looping
>>>>>> on send requests and wait for them to complete can slow down the
>>>>>> communications, as you force (the send) order one more time (the receive
>>>>>> order is already forced here
>>>>>> <https://gitlab.com/yade-dev/trunk/blob/mpi/py/mpy.py#L641>).
>>>>>>
>>>>> ... but not using a global barrier allows the first threads that
>>>>> finished their sends/recvs to start the next DEM iteration before the
>>>>> others, +1 for your fix so finally I don't know what's better. Anyway
>>>>> that's probably not meaningful compared to the interaction loop timings.
>>>>> --
>>>>> Mailing list: https://launchpad.net/~yade-mpi
>>>>> Post to     : yade-mpi@xxxxxxxxxxxxxxxxxxx
>>>>> Unsubscribe : https://launchpad.net/~yade-mpi
>>>>> More help   : https://help.launchpad.net/ListHelp
>>>>>
>>>>
>>>>
>>>> --
>>>> --
>>>> _______________
>>>> Bruno Chareyre
>>>> Associate Professor
>>>> ENSE³ - Grenoble INP
>>>> Lab. 3SR
>>>> BP 53
>>>> 38041 Grenoble cedex 9
>>>> Tél : +33 4 56 52 86 21
>>>> ________________
>>>>
>>>> Email too brief?
>>>> Here's why: email charter
>>>> <https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>
>>>>
>>>
>>>
>>> --
>>> --
>>> _______________
>>> Bruno Chareyre
>>> Associate Professor
>>> ENSE³ - Grenoble INP
>>> Lab. 3SR
>>> BP 53
>>> 38041 Grenoble cedex 9
>>> Tél : +33 4 56 52 86 21
>>> ________________
>>>
>>> Email too brief?
>>> Here's why: email charter
>>> <https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>
>>> --
>>> Mailing list: https://launchpad.net/~yade-mpi
>>> Post to     : yade-mpi@xxxxxxxxxxxxxxxxxxx
>>> Unsubscribe : https://launchpad.net/~yade-mpi
>>> More help   : https://help.launchpad.net/ListHelp
>>>
>>
>
> --
> --
> _______________
> Bruno Chareyre
> Associate Professor
> ENSE³ - Grenoble INP
> Lab. 3SR
> BP 53
> 38041 Grenoble cedex 9
> Tél : +33 4 56 52 86 21
> ________________
>
> Email too brief?
> Here's why: email charter
> <https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>
>


-- 
-- 
_______________
Bruno Chareyre
Associate Professor
ENSE³ - Grenoble INP
Lab. 3SR
BP 53
38041 Grenoble cedex 9
Tél : +33 4 56 52 86 21
________________

Email too brief?
Here's why: email charter
<https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>

Follow ups

References