yade-mpi team mailing list archive

Thread
Date

Re: deadlock fixed (?)

To: yade-mpi@xxxxxxxxxxxxxxxxxxx
From: Bruno Chareyre <bruno.chareyre@xxxxxxxxxxxxxxx>
Date: Thu, 6 Jun 2019 17:21:14 +0200
In-reply-to: <CAM1kzpQSC=GPYMwFMn7h36R-6-EPv_a4eSbHNxDOjAt-zrKkFg@mail.gmail.com>

After https://gitlab.com/yade-dev/trunk/commit/1bad411b everything runs in
PyRunner for standard iterations.
It's not possible to merge-split with this method since it would lead the
scene to erase itself.
I also made sending force interactive and avoioded one collisiion
detection, with no clear improvement in speed though...
Bruno


On Thu, 6 Jun 2019 at 15:30, Deepak Kn <deepak.kn1990@xxxxxxxxx> wrote:

> Hello,
>
>> I found a new problem I don't understand with "-ms". :(
>> It doesnt occur all the time, here is a way:
>> mpiexec --tag-output -n 3 ../../yade-mpi testMPI_2D_BUG_DK.py 50 50 -ms
>>
>
> Hi, I have fixed this now in commit : f98281dc. The issue was during each
> split after merge the ranks of Subdomains (defined in CPP) were getting
> erased, since commit : 68e358b9.
>
> On Thu, Jun 6, 2019 at 11:54 AM Bruno Chareyre <
> bruno.chareyre@xxxxxxxxxxxxxxx> wrote:
>
>> I found a new problem I don't understand with "-ms". :(
>> It doesnt occur all the time, here is a way:
>> mpiexec --tag-output -n 3 ../../yade-mpi testMPI_2D_BUG_DK.py 50 50 -ms
>>
>> The differences between the attached script and the one in trunk are:
>> loopOnSortedInteractions=True
>> MERGE_W_INTERACTIONS=True
>>
>> With the trunk version the problem does not occur.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *[1,1]<stderr>:Running script testMPI_2D_BUG_DK.py[1,0]<stderr>:Running
>> script testMPI_2D_BUG_DK.py[1,2]<stderr>:Running script
>> testMPI_2D_BUG_DK.py[1,1]<stdout>:Worker1: triggers collider at iter
>> 354[1,2]<stdout>:Worker2: triggers collider at iter 354[1,0]<stdout>:init
>> Done in  MASTER 0[1,2]<stdout>:Worker2: triggers collider at iter
>> 501[1,1]<stdout>:Worker1: triggers collider at iter
>> 501[1,2]<stderr>:Traceback (most recent call last):[1,1]<stderr>:Traceback
>> (most recent call last):[1,2]<stderr>:  File "../../yade-mpi", line 244, in
>> runScript[1,2]<stderr>:    execfile(script,globals())[1,2]<stderr>:  File
>> "testMPI_2D_BUG_DK.py", line 114, in <module>[1,2]<stderr>:
>>  mp.mpirun(NSTEPS)[1,2]<stderr>:  File
>> "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 676, in
>> mpirun[1,2]<stderr>:    mergeScene()[1,1]<stderr>:  File "../../yade-mpi",
>> line 244, in runScript[1,2]<stderr>:  File
>> "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 423, in
>> mergeScene[1,2]<stderr>:    O.subD.mergeOp()[1,2]<stderr>:RuntimeError:
>> vector::_M_default_append[1,1]<stderr>:
>>  execfile(script,globals())[1,1]<stderr>:  File "testMPI_2D_BUG_DK.py",
>> line 114, in <module>[1,1]<stderr>:    mp.mpirun(NSTEPS)[1,1]<stderr>:
>>  File "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 676,
>> in mpirun[1,1]<stderr>:    mergeScene()[1,1]<stderr>:  File
>> "/home/yade/lib/x86_64-linux-gnu/yade-mpi/py/yade/mpy.py", line 423, in
>> mergeScene[1,1]<stderr>:    O.subD.mergeOp()[1,1]<stderr>:RuntimeError:
>> vector::_M_default_append--------------------------------------------------------------------------mpiexec
>> has exited due to process rank 2 with PID 19994 onnode dt-medXXX exiting
>> improperly. There are three reasons this could occur:1. this process did
>> not call "init" before exiting, but others inthe job did. This can cause a
>> job to hang indefinitely while it waitsfor all processes to call "init". By
>> rule, if one process calls "init",then ALL processes must call "init" prior
>> to termination.*
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, 6 Jun 2019 at 11:17, Bruno Chareyre <
>> bruno.chareyre@xxxxxxxxxxxxxxx> wrote:
>>
>>> Hi,
>>> @François, now I understand why there is no deadlock (point 1/), thanks.
>>> That was difficult for me to realize, Deepak helped. :)
>>>
>>> About checkCollider and global barriers: *we definitely want to avoid
>>> any barrier*.
>>> The reason is: there is already a kind of barrier(*) at each iteration
>>> since master has to receive forces (before Newton), and send back wall
>>> positions (after Newton) (let's call "master sync" this sequence
>>> forces+Newton+positions).
>>> Between two master syncs all workers should run at max speed without
>>> waiting for another global event.
>>> When we send positions at iteration N we know in each SD if collision
>>> detection is needed at the begining of iteration N+1. It can be
>>> communicated to master. Then, at least two options:
>>> - master will tell everyone at the next master sync. In that case global
>>> collision detection would be delayed by one iteration, it will occur at
>>> N+2. That delay is technically perfectly fine since the SD which really
>>> need immediate colliding will do it spontaneously at N+1 regardless of
>>> global instructions. The downside of this approach is that if only one
>>> subdomain is colliding at N+1, this SD will be slower and others will have
>>> to wait for it to finish for the next master sync. Then collision detection
>>> again at N+2, this would probably double the total cost of collision
>>> detection.
>>> - send yes/no to master in the "positions" stage (for the moment nothing
>>> is sent to master in that step) + complete master sync with an additional
>>> communication from master to workers.
>>>
>>> Side question: what's the use of engine "waitForcesRunner"? I removed it
>>> and it works just as well.
>>>
>>> (*) It's only a partial barrier since some subdomains may not interact
>>> with master, but we can change that to force all domains to send at least a
>>> yes/no to master.
>>>
>>> Bruno
>>>
>>>
>>>
>>>
>>> On Tue, 4 Jun 2019 at 16:41, François <francois.kneib@xxxxxxxxx> wrote:
>>>
>>>> Concerning the non blocking MPI_ISend, using MPI_Wait was not
>>>>> necessary with the use of a basic global barrier. I'm afraid that looping
>>>>> on send requests and wait for them to complete can slow down the
>>>>> communications, as you force (the send) order one more time (the receive
>>>>> order is already forced here
>>>>> <https://gitlab.com/yade-dev/trunk/blob/mpi/py/mpy.py#L641>).
>>>>>
>>>> ... but not using a global barrier allows the first threads that
>>>> finished their sends/recvs to start the next DEM iteration before the
>>>> others, +1 for your fix so finally I don't know what's better. Anyway
>>>> that's probably not meaningful compared to the interaction loop timings.
>>>> --
>>>> Mailing list: https://launchpad.net/~yade-mpi
>>>> Post to     : yade-mpi@xxxxxxxxxxxxxxxxxxx
>>>> Unsubscribe : https://launchpad.net/~yade-mpi
>>>> More help   : https://help.launchpad.net/ListHelp
>>>>
>>>
>>>
>>> --
>>> --
>>> _______________
>>> Bruno Chareyre
>>> Associate Professor
>>> ENSE³ - Grenoble INP
>>> Lab. 3SR
>>> BP 53
>>> 38041 Grenoble cedex 9
>>> Tél : +33 4 56 52 86 21
>>> ________________
>>>
>>> Email too brief?
>>> Here's why: email charter
>>> <https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>
>>>
>>
>>
>> --
>> --
>> _______________
>> Bruno Chareyre
>> Associate Professor
>> ENSE³ - Grenoble INP
>> Lab. 3SR
>> BP 53
>> 38041 Grenoble cedex 9
>> Tél : +33 4 56 52 86 21
>> ________________
>>
>> Email too brief?
>> Here's why: email charter
>> <https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>
>> --
>> Mailing list: https://launchpad.net/~yade-mpi
>> Post to     : yade-mpi@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~yade-mpi
>> More help   : https://help.launchpad.net/ListHelp
>>
>

-- 
-- 
_______________
Bruno Chareyre
Associate Professor
ENSE³ - Grenoble INP
Lab. 3SR
BP 53
38041 Grenoble cedex 9
Tél : +33 4 56 52 86 21
________________

Email too brief?
Here's why: email charter
<https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>

Follow ups

Re: deadlock fixed (?)
From: Bruno Chareyre, 2019-06-06

References

deadlock fixed (?)
From: Bruno Chareyre, 2019-05-30
Re: deadlock fixed (?)
From: Deepak Kn, 2019-05-31
Re: deadlock fixed (?)
From: Bruno Chareyre, 2019-06-01
Re: deadlock fixed (?)
From: François, 2019-06-04
Re: deadlock fixed (?)
From: François, 2019-06-04
Re: deadlock fixed (?)
From: François, 2019-06-04
Re: deadlock fixed (?)
From: Bruno Chareyre, 2019-06-06
Re: deadlock fixed (?)
From: Bruno Chareyre, 2019-06-06
Re: deadlock fixed (?)
From: Deepak Kn, 2019-06-06