yade-mpi team mailing list archive

Thread
Date

Re: deadlock fixed (?)

From: Bruno Chareyre <bruno.chareyre@xxxxxxxxxxxxxxx>
Date: Thu, 6 Jun 2019 11:17:42 +0200
Cc: yade-mpi@xxxxxxxxxxxxxxxxxxx
In-reply-to: <CA+Zp+T1gZBtnFNU6LDAf9no9Zd6BKfRb2LMMradpXdTbAJb2pQ@mail.gmail.com>

Hi,
@François, now I understand why there is no deadlock (point 1/), thanks.
That was difficult for me to realize, Deepak helped. :)

About checkCollider and global barriers: *we definitely want to avoid any
barrier*.
The reason is: there is already a kind of barrier(*) at each iteration
since master has to receive forces (before Newton), and send back wall
positions (after Newton) (let's call "master sync" this sequence
forces+Newton+positions).
Between two master syncs all workers should run at max speed without
waiting for another global event.
When we send positions at iteration N we know in each SD if collision
detection is needed at the begining of iteration N+1. It can be
communicated to master. Then, at least two options:
- master will tell everyone at the next master sync. In that case global
collision detection would be delayed by one iteration, it will occur at
N+2. That delay is technically perfectly fine since the SD which really
need immediate colliding will do it spontaneously at N+1 regardless of
global instructions. The downside of this approach is that if only one
subdomain is colliding at N+1, this SD will be slower and others will have
to wait for it to finish for the next master sync. Then collision detection
again at N+2, this would probably double the total cost of collision
detection.
- send yes/no to master in the "positions" stage (for the moment nothing is
sent to master in that step) + complete master sync with an additional
communication from master to workers.

Side question: what's the use of engine "waitForcesRunner"? I removed it
and it works just as well.

(*) It's only a partial barrier since some subdomains may not interact with
master, but we can change that to force all domains to send at least a
yes/no to master.

Bruno




On Tue, 4 Jun 2019 at 16:41, François <francois.kneib@xxxxxxxxx> wrote:

> Concerning the non blocking MPI_ISend, using MPI_Wait was not necessary
>> with the use of a basic global barrier. I'm afraid that looping on send
>> requests and wait for them to complete can slow down the communications, as
>> you force (the send) order one more time (the receive order is already
>> forced here <https://gitlab.com/yade-dev/trunk/blob/mpi/py/mpy.py#L641>).
>>
> ... but not using a global barrier allows the first threads that finished
> their sends/recvs to start the next DEM iteration before the others, +1 for
> your fix so finally I don't know what's better. Anyway that's probably not
> meaningful compared to the interaction loop timings.
> --
> Mailing list: https://launchpad.net/~yade-mpi
> Post to     : yade-mpi@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~yade-mpi
> More help   : https://help.launchpad.net/ListHelp
>


-- 
-- 
_______________
Bruno Chareyre
Associate Professor
ENSE³ - Grenoble INP
Lab. 3SR
BP 53
38041 Grenoble cedex 9
Tél : +33 4 56 52 86 21
________________

Email too brief?
Here's why: email charter
<https://marcuselliott.co.uk/wp-content/uploads/2017/04/emailCharter.jpg>

Follow ups

Re: deadlock fixed (?)
From: François, 2019-06-06
Re: deadlock fixed (?)
From: Bruno Chareyre, 2019-06-06

References

deadlock fixed (?)
From: Bruno Chareyre, 2019-05-30
Re: deadlock fixed (?)
From: Deepak Kn, 2019-05-31
Re: deadlock fixed (?)
From: Bruno Chareyre, 2019-06-01
Re: deadlock fixed (?)
From: François, 2019-06-04
Re: deadlock fixed (?)
From: François, 2019-06-04
Re: deadlock fixed (?)
From: François, 2019-06-04