yade-mpi team mailing list archive

Thread
Date

Re: deadlock fixed (?)

To: yade-mpi@xxxxxxxxxxxxxxxxxxx
From: François <francois.kneib@xxxxxxxxx>
Date: Thu, 6 Jun 2019 14:01:53 +0200
In-reply-to: <CANFfKpFZfiWpW6EtdAShjE01dAK0Co=EPZjHbBOO3moTiFwidw@mail.gmail.com>

>
> About checkCollider and global barriers: *we definitely want to avoid any
> barrier*.
> The reason is: there is already a kind of barrier(*) at each iteration
> since master has to receive forces (before Newton), and send back wall
> positions (after Newton) (let's call "master sync" this sequence
> forces+Newton+positions).
> Between two master syncs all workers should run at max speed without
> waiting for another global event.
>
I my opinion, additional barriers will add no overhead in our case as soon
as we add them *successively* to other barriers. But yes definitely, no
barriers during forceReseter->collider->interactionLoop. I'm okay with the
deletion of the barrier just after updateMirrorIntersections (body-copy),
but I'm pretty sure that the computing time is the same, because we just
have made a kind-of barrier with the "master sync".

When we send positions at iteration N we know in each SD if collision
> detection is needed at the begining of iteration N+1. It can be
> communicated to master. Then, at least two options:
> - master will tell everyone at the next master sync. In that case global
> collision detection would be delayed by one iteration, it will occur at
> N+2. That delay is technically perfectly fine since the SD which really
> need immediate colliding will do it spontaneously at N+1 regardless of
> global instructions. The downside of this approach is that if only one
> subdomain is colliding at N+1, this SD will be slower and others will have
> to wait for it to finish for the next master sync. Then collision detection
> again at N+2, this would probably double the total cost of collision
> detection.
> - send yes/no to master in the "positions" stage (for the moment nothing
> is sent to master in that step) + complete master sync with an additional
> communication from master to workers.
>

I would say that the second option, which looks easier and faster is the
best. But more than that: if we consider (as I said before) that successive
barriers aren't really slower than one single barrier, this option 2 is
already what we do:
- interactionLoop for everyone -> processes may be not synchronized,
- master sync, after which all processes are (at least almost) synchronized,
- checkColliderActivated where we send yes/no to everyone, its a real
barrier but all processes just have been synchronized by the master sync.


> Side question: what's the use of engine "waitForcesRunner"? I removed it
> and it works just as well.
>

It seems not to be used anymore as O.freqs is never filled. However, I have
a small problem with this line of code
<https://gitlab.com/yade-dev/trunk/blob/mpi/py/mpy.py#L172> when the master
receiveForces from the workers. Here the master is waiting for each worker
to receive the force in a predefined order, and here
<https://gitlab.com/yade-dev/trunk/blob/mpi/py/mpy.py#L402> the workers are
waiting their turn to send it. If the worker #1 is slow, they will all wait
for it while an asynchronous communication would allow the faster workers
to already do the Newton (and in the worth case, wait for worker #1 later).


> (*) It's only a partial barrier since some subdomains may not interact
> with master, but we can change that to force all domains to send at least a
> yes/no to master.
>
> Bruno
>

François

References

deadlock fixed (?)
From: Bruno Chareyre, 2019-05-30
Re: deadlock fixed (?)
From: Deepak Kn, 2019-05-31
Re: deadlock fixed (?)
From: Bruno Chareyre, 2019-06-01
Re: deadlock fixed (?)
From: François, 2019-06-04
Re: deadlock fixed (?)
From: François, 2019-06-04
Re: deadlock fixed (?)
From: François, 2019-06-04
Re: deadlock fixed (?)
From: Bruno Chareyre, 2019-06-06