yade-dev team mailing list archive

Thread
Date
parallelization rentability

To: yade-dev@xxxxxxxxxxxxxxxx
From: Václav Šmilauer <eudoxos@xxxxxxxx>
Date: Mon, 25 Dec 2006 12:02:43 +0100
Delivered-to: janek_listy@xxxxx
Delivered-to: yade-dev@xxxxxxxxxxxxxxxx
Reply-to: Yade Development Group <yade-dev@xxxxxxxxxxxxxxxx>
Sender: yade-dev-bounces@xxxxxxxxxxxxxxxx
User-agent: Mutt/1.5.12-2006-07-14
Hello,

I would like to have some discussion on the parallel processing
within yade. I know that there is effort to have distributed-memory
parallelilsm using boost::mpi. I did just a small calculation from the
top of my head reagrding communication issues of that. 

(BTW I recommend this great presentation, whic was at the source of my
reflections: http://people.redhat.com/dnovillo/Papers/rhs2006.pdf;
likewise for http://www.llnl.gov/computing/tutorials/parallel_comp/ )

Typical DEM simulation does many iterations / sec (say 100 or more?),
contrary to perhaps FEM, where there is much fewer iterations, but they
are more computation intensive. Given that nodes must be synchronized in
some way after _each_ iteration (perhaps that might be optimized, but
the ways to that are not quite evident, at least not to me), the
computation/communication ratio is quite low. The communication will add
for each iteration:

1. constant network latency for the roundtrip; I mean sum of latencies
at all TCP/IP levels (application, transport, network, link, physical).
2. linear time to transmit some data over the network.

I disregard 2. for now, since (a) can be optimized much more easily by
some sort of compression, caching, etc.(perhaps at the expense of
latency) and (b) will not probably be as high as 1: if we consider
40Mbytes/sec (on a switched Gbit network - seems realistic to me,
provided that NICs are not on 32-bit PCI), we get to 40kb/1ms.

Now, to estimate 1., this (http://lqcd.fnal.gov/trends.html) reports
MPI latency of 5ms on infiniband network. Frankly, with UTP switched
1Gbit network, we can be pretty much confident to not get even there.
Yade-level overhead (if we consider MPI itself being already in those
numbers above) will probably be quite significant as well, since all the
data will have to be serialized and deserialized before/after
transmission. Conclusion: lucky are those who squeeze it under 10ms.

Let us see what is the rentability condition for parallelization:

If we have 100 iterations/sec with one node (10ms/iter), supposing
perfect scalability of the computation and 2 nodes, we have 5ms/iter for
computation + 10ms of network latency = 15ms (slower!). Generally, if n
is number of nodes, l1 and l2 are fixed and linear latencies and t1 is
per-iteration time on single node (still with perfect scalability) we
get tn=(1/n)*(t1+l2)+l1.

Since l1 is more or less fixed, we get good scalability for big t1 (huge
simulations, hence relatively few timesteps) and/or big l2.
Specifically, t1>tn (otherwise simulation is faster on a single
node) leads to (1-1/n)t1>l1+l2/n. Since (1-1/n)<1, we are better off
with a single node for t1≈l1, which is the case for 100iter/sec and 10ms
latency.

Remember that we supposed perfect scalability, real rentability condition
would be, for scalability coefficient s<1, tn'=tn/s, therefore
(1-1/n)t1>l1/s+l2/(n*s), which is accordingly more restrictive.

QED.

---

Since the expected raise of multi-core processors, I would rather
propose to see if we couldn't use openMP for (local) parallel execution
with share memory - it is a standard that is well supported by gcc (but
also icc), changes to the code are not as big, there is no communication
overhead. We could benefit from results much earlier, inserting #pragmas
here and there. We would have to try whether openMP can parallelize
iterator loops (it should be, at least for random-access iterators).

I would also expect that once some project like openSSI or openMosix get
really functional, their global memory address-space would permit even
mutiple machines and their synchronization would be handled
automatically by the system, perhaps even more efficiently (?).

And, BTW, stay tuned for post-4.2 gcc, its tree vectorization code gets
better steadily.

Enjoy the Christmas,

Vaclav

_______________________________________________
yade-dev mailing list
yade-dev@xxxxxxxxxxxxxxxx
https://lists.berlios.de/mailman/listinfo/yade-dev