yade-mpi team mailing list archive
Mailing list archive
Re: deadlock fixed (?)
On Fri, 31 May 2019 at 20:22, Deepak Kn <deepak.kn1990@xxxxxxxxx> wrote:
> I think there is one more fix to be done (from commit 7dd44a4a in mpi) :
> The bodies are sent using the non blocking MPI_ISend, this has to be
> completed with the MPI_Wait, as of now the mpi_waits are not called, and
> there is a minory memory issue to be fixed which François is working on.
My previous claim was wrong. I did not fix anything, I got luck 10 times in
a row without deadlock, that's it. :)
Now it's really fixed though. Mainly by : this fixes also a memory leak
which may affect the current master branch, it makes some virtual
interactions impossible to erase (erase b1, then b2->intrs and
O.interactions would keep that virtual interaction forever); it was
preventing correct insertion of new interactions.
There are a few additional changes which I think are useful even if no bug
were identified yet (namely: do not try to insert interaction b1-b2 until
both b1 and b2 are inserted), and some simplifications in Subdomain.cpp
implementations. Let me know what you think (better check my last version
since some changes have been reverted after finding the main problem).
Main questions I had as I was reading the code:
1/ @Francois, Isn't there an obvious deadlock at  since if there is
nothing to send we don't even send an empty list?
2/ any idea why gitlab pipeline fails at the cmake stage?
3/ checkcollider is the most expensive communication according to the
script output, is it real or artificial? If it's real we can easily combine
it with another comm to remove that barrier.
Main observation at the moment is that interactionLoop take >90% of the cpu
time in both mpi and serial runs. Therefore the ideas of optimizing body
container size and collision detection upon body insertion (for the moment
we re-sort everything each time we insert a new body) are somehow
p.s. François, if you reply to yade-mpi list please make sure you are not
in quaratine, this time :)
> On Thu, May 30, 2019 at 5:58 PM Bruno Chareyre <
> bruno.chareyre@xxxxxxxxxxxxxxx> wrote:
>> I am still unable to explain fully why the deadlock occured. I can tell
>> interactions between subdomains were removed by accident here
>> (and then mirror intersections were inconsistent) but I don't know why.
>> Bruno Chareyre
>> Associate Professor
>> ENSE³ - Grenoble INP
>> Lab. 3SR
>> BP 53
>> 38041 Grenoble cedex 9
>> Tél : +33 4 56 52 86 21
>> Email too brief?
>> Here's why: email charter
>> Mailing list: https://launchpad.net/~yade-mpi
>> Post to : yade-mpi@xxxxxxxxxxxxxxxxxxx
>> Unsubscribe : https://launchpad.net/~yade-mpi
>> More help : https://help.launchpad.net/ListHelp
ENSE³ - Grenoble INP
38041 Grenoble cedex 9
Tél : +33 4 56 52 86 21
Email too brief?
Here's why: email charter