← Back to team overview

dolfin team mailing list archive

Re: [HG DOLFIN] merge

 

> On Mon, Aug 17, 2009 at 11:20:08PM +0200, Johan Hake wrote:
>> On Monday 17 August 2009 19:19:40 Anders Logg wrote:
>> > On Mon, Aug 17, 2009 at 07:09:11PM +0200, DOLFIN wrote:
>> > > changeset:   6762:ca407204632a1b0430099c243c915a151b2bd941
>> > > parent:      6759:efc24a341e41e9e0c83616be4613d819fe95ccb6
>> > > user:        Anders Logg <logg@xxxxxxxxx>
>> > > date:        Mon Aug 17 19:08:56 2009 +0200
>> > > files:       site-packages/dolfin/compile_function.py
>> > > site-packages/dolfin/jit.py description:
>> > > Make JIT compiler work in parallel. The process number is added to
>> the
>> > > signature to create a unique signature for each process. This means
>> that
>> > > each process will compile its own form. This may not be optimal and
>> could
>> > > possibly be handled by Instant. On the other hand, it seems to work
>> > > nicely and might also be advantageous when processes don't share a
>> common
>> > > cache.
>> >
>> > The Poisson Python demo now runs as is without the need for first
>> > running it in serial (to handle JIT compilation):
>>
>> Did it not work before this change? I know Martin added some file locks
>> to
>> prevent simultaneous compilations of the same module.
>
> No, it didn't work before. I get things like
>
> In instant.build_module: Path
> '/home/logg/.instant/cache/form_f38430af401fbeddb9be4091a6fcde37cef9fa35'
> already exists, but module wasn't found in cache previously. Not
> overwriting, assuming this module is valid.
> Traceback (most recent call last):
>   File "demo.py", line 23, in <module>
>     V = FunctionSpace(mesh, "CG", 1)
>   File
>   "/home/logg/scratch/src/fenics-dev/dolfin-dev/local/lib/python2.6/site-packages/dolfin/functionspace.py",
>   line 181, in __init__
>     FunctionSpaceBase.__init__(self, mesh, element)
>   File
>   "/home/logg/scratch/src/fenics-dev/dolfin-dev/local/lib/python2.6/site-packages/dolfin/functionspace.py",
>   line 43, in __init__
>     ufc_element, ufc_dofmap = jit(self._element)
>   File
>   "/home/logg/scratch/src/fenics-dev/dolfin-dev/local/lib/python2.6/site-packages/dolfin/jit.py",
>   line 67, in jit
>     return jit_compile(form, options)
>   File
>   "/home/logg/scratch/lib/fenics-dev/lib/python2.6/site-packages/ffc/jit/jit.py",
>   line 56, in jit
>     return jit_element(object, options)
>   File
>   "/home/logg/scratch/lib/fenics-dev/lib/python2.6/site-packages/ffc/jit/jit.py",
>   line 125, in jit_element
>     (compiled_form, module, form_data) = jit_form(form, options)
>   File
>   "/home/logg/scratch/lib/fenics-dev/lib/python2.6/site-packages/ffc/jit/jit.py",
>   line 102, in jit_form
>     os.unlink(signature + ".h")
>   OSError: [Errno 2] No such file or directory:
>   'form_f38430af401fbeddb9be4091a6fcde37cef9fa35.h'
>
>
> I guess the second process tries to read the generated file but
> it's not ready yet (still being generated by the first process).
>
> It would be good to handle the parallel JIT compilation as part of
> Instant, but I don't know what the best solution is.


OK, I think the parallel caching mechanism works e.g. on bigblue when several
similar processes start on different machines that has its own local
tmp files and only the first on to finish compiling will copy it to the
actual cache.
The problem here might be that the different processes mess with the same
tmp files. I'll ask Martin.

Kent

>
>> >   mpirun -n 4 python demo.py
>>
>> Do I have to set some environmental variables to make this work. I can't
>> get
>> it to work (probably some stupid error) :P
>
> No, nothing. It should work out of the box.
>
>> Johan
>>
>> When running the above command I get:
>>
>> ssh: connect to host hake-laptop port 22: Connection refused
>
> Can you run other processes in parallel?
>
>   mpirun -n 4 ls
>
> Maybe you need to install sshd? I didn't know it was required.
>
> --
> Anders
>
>> --------------------------------------------------------------------------
>> A daemon (pid 32065) died unexpectedly with status 255 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
>> the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --------------------------------------------------------------------------
>> mpirun: clean termination accomplished
>>
>>
>>
> _______________________________________________
> DOLFIN-dev mailing list
> DOLFIN-dev@xxxxxxxxxx
> http://www.fenics.org/mailman/listinfo/dolfin-dev
>




References