yade-dev team mailing list archive

Thread
Date

Re: multicore speed / threads issues

To: yade-dev@xxxxxxxxxxxxxxxxxxx
From: Janek Kozicki <janek_listy@xxxxx>
Date: Sun, 2 May 2010 14:34:09 +0200
Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAALVBMVEUBAQEtLS1KSkpRUVFXV1dYWFhjY2Nzc3N3d3eHh4eKioqdnZ24uLjLy8vc3NxVIagyAAAACXBIWXMAAAsTAAALEwEAmpwYAAAAB3RJTUUH2AIVEzgS1fgQtQAAAjRJREFUOMtt1DFv00AUAOAzFQNbjigSyoQaRaBMhKgLUyKXpVNNeUpk9vyDqFJhQ1kiBuaqAwJCqvPtSLY7RlTn5+5IdnYkkt/AOyfxXVLe5vf53Z1875kd34tOEax8djmj6GyjhB5bxz50GdsVZr9fqRjZwAtKOJw5Wqs2MMZ16ALHsaDncF7xAHix1oEFHAB8f+pRjcO4gfZDykcYzbiucRolOLUJ6kjA0xtVt+A6TySlM0RajIpK6DzwKZ/nOYbF/gclHMo1ZOHYY/+Ha+AWuM+3oMS4eeqYzZ8FiCltgUqI8cd2wwAVpJk+8LWYjBtnJdQpHQqJMd4Oxt4bU9ESiFGc5hkqaH74asAX4iabP5I5gZ+qjgGlJCqZa3h3lxhoeVcSE1qLQC4sqKOK9MGW9E3izFqqHokoztLFEgXg31sbZEKnWi2T74A4NxfVQqlkjKtcAWD+zcArFEES01dR0E/nnV0IgugmDd/2L84sOAouRBBHEc7gtc8teDkRlE0iNQPo2w3Xhh/D4TCIQ4LRLoTvgwjj6RRgavdurxYGMaIuGOyAW/PpNlCcU9/93AHenAWYjPoAwa+G3e3to/MgFNTAEKvKDjzuCzHTnY3qqdXtx24VijzQfZ0yewZ5cwRFQaa+mIYr1uI0I76+3W4xhlvoVRwOA0Fdl64HlJnxP6T8YpX/Lga4Wv4A3ErrU5oTfN7Mu/llXMl8RXEPji/lQkN3H7qXqgC2By47EXeU/7PJ/wPxRKMnuZwIeAAAAABJRU5ErkJggg==
In-reply-to: <1272665403.1375.62.camel@flux>

Václav Šmilauer said:     (by the date of Sat, 01 May 2010 00:10:03 +0200)

> (sorry, sending to the list again)
> 
> > I see, how about using this one, to avoid global locking:
> > 
> > http://www.chaoticmind.net/~hcb/projects/boost.atomic/
> > 
> If you talk about InteractionContainer::drawloopmutex, then
> 
> 1. you can always shut down the 3d view to get rid of it;

I did some benchmarks. By hand unfortunately.. I was typing O.run()
waiting 60 seconds, then O.pause(). I'm sure you can script that in
python ;)

And I was restarting yade by changing OMP_NUM_THREADS=1, and I've
seen in htop, that indeed the number the processors were occupied. I
was using a slightly modified funnel.py for that (it has some clumps
added to it). It is attached, I don't know if I can commit such a
modified funnel.py.

Now, look at my weird results (60 seconds of simulation):

OMP_NUM_THREADS    3D on     3D off
1                  5375      5498   (1 core at 100%)
2                  6340      6582   (2 cores at 95%)
3                  6977      7040   (3 cores at 95%)
4                  6624      6939   (4 cores at 100%)
5                  6240      6970   (5 cores at 95%)
8                  1267      3491   (8 cores at 100%)
12                 6018      5806   (8 cores at 45%)

At 8 cores, it is actually slower than at 1 core.
At 12 threads (using 8 cores) total load drops, and is slightly
faster than at 1 core.

You said that with little amount of bodies the speedup is not
visible. I am confirming this ;) Later I'll see how it works with
thousands of bodies.

> 2. it is not related to atomicity, right? We're modifying container from
> one thread while another thread loops over it, which invalidates the
> iterator. (Performance is more important, please no hacks to fix this).
> 
> A good and easy optimization might be to remove the mutex altogether
> when compiling without OpenGL, to avoid uselessly checking its status
> when interaction is added/deleted.

Don't worry, I don't want hacks! I'm just pondering this. Probably
you are right that atomicity isn't the main part of that.

On boost ML one guy recommended tbb::concurrent_vector and
intel's thread building blocks library is in debian. But adding
another library to our dependencies isn't something to be done
hastily. Just things that we can keep in mind.

on top of this page is my thread about that:
http://lists.boost.org/Archives/boost/2010/04/index.php

but, I forgot a lot of stuff about concurrent programming. Writing
that ThreadRunner was the peak of my skills ;)

BTW, I see that getProgress() lack of naming convention. Apparently
that was me ;) So how is it now in the whole rest of the code? Do we
have getVal(); setVal(val) now?

-- 
Janek Kozicki                               http://janek.kozicki.pl/  |

from numpy import arange
from numpy import linspace
from yade import pack
import itertools
# generacja powierzchni zakrzywionej - rurka.
thetas=linspace(0,2*pi,num=16,endpoint=True)
meridians=pack.revolutionSurfaceMeridians([[(3+rad*sin(th),10*rad+rad*cos(th)) for th in thetas] for rad in linspace(1,2,num=10)],linspace(0,pi,num=10))
surf=pack.sweptPolylines2gtsSurface(meridians+[[Vector3(5*sin(-th),-10+5*cos(-th),30) for th in thetas]])
O.bodies.append(pack.gtsSurface2Facets(surf))

# generacja elementow
luzne_kulki=pack.SpherePack()
kostka_1=pack.SpherePack()
kostka_2=pack.SpherePack()

# kulki, luzem
luzne_kulki.makeCloud(Vector3(-3,-9,30),Vector3(2,-13,32),.2,rRelFuzz=.3,num=340)
O.bodies.append([utils.sphere(c,r) for c,r in luzne_kulki])

# brylki
for xyz in itertools.product(arange(0,3),arange(0,3),arange(0,3)):
	ids_spheres=O.bodies.appendClumped(pack.regularHexa(pack.inEllipsoid((-2+xyz[0]*2.2,-12+xyz[1]*2.4,34+xyz[2]*2.6),(0.5,0.7,1.1)),radius=0.15,gap=0,color=[0.5,0.7,0.3]))

# podloga
O.bodies.append([
	utils.facet([[-30,-30,9],[30,-30,9],[30,30,9]],dynamic=False,color=[1,0,0]),
	utils.facet([[-30,-30,9],[-30,30,9],[30,30,9]],dynamic=False,color=[1,0,0]),
])

# petla obliczeniowa
O.engines=[
	ForceResetter(), 
	BoundDispatcher([Bo1_Sphere_Aabb(),Bo1_Facet_Aabb()]),
	InsertionSortCollider(),
	InteractionDispatchers(
	[Ig2_Sphere_Sphere_Dem3DofGeom(),
		Ig2_Facet_Sphere_Dem3DofGeom()],
		[Ip2_FrictMat_FrictMat_FrictPhys()],
		[Law2_Dem3DofGeom_FrictPhys_Basic()]
	),
	GravityEngine(gravity=(0,0,-9.81)),
	NewtonIntegrator(),
	VTKRecorder(recorders=['spheres','facets','colors'],fileName='/tmp/p1',realPeriod=.5)
]
O.dt=utils.PWaveTimeStep()

# gotowe!

Follow ups

Re: multicore speed / threads issues
From: Václav Šmilauer, 2010-05-02

References

multicore speed / threads issues
From: Janek Kozicki, 2010-04-29
Re: multicore speed / threads issues
From: Václav Šmilauer, 2010-04-29
Re: multicore speed / threads issues
From: Janek Kozicki, 2010-04-30
Re: multicore speed / threads issues
From: Václav Šmilauer, 2010-04-30