← Back to team overview

dolfin team mailing list archive

Re: [HG DOLFIN] merge

 

On Monday 17 August 2009 23:51:39 Anders Logg wrote:
> On Mon, Aug 17, 2009 at 11:20:08PM +0200, Johan Hake wrote:
> > On Monday 17 August 2009 19:19:40 Anders Logg wrote:
> > > On Mon, Aug 17, 2009 at 07:09:11PM +0200, DOLFIN wrote:
> > > > changeset:   6762:ca407204632a1b0430099c243c915a151b2bd941
> > > > parent:      6759:efc24a341e41e9e0c83616be4613d819fe95ccb6
> > > > user:        Anders Logg <logg@xxxxxxxxx>
> > > > date:        Mon Aug 17 19:08:56 2009 +0200
> > > > files:       site-packages/dolfin/compile_function.py
> > > > site-packages/dolfin/jit.py description:
> > > > Make JIT compiler work in parallel. The process number is added to
> > > > the signature to create a unique signature for each process. This
> > > > means that each process will compile its own form. This may not be
> > > > optimal and could possibly be handled by Instant. On the other hand,
> > > > it seems to work nicely and might also be advantageous when processes
> > > > don't share a common cache.
> > >
> > > The Poisson Python demo now runs as is without the need for first
> > > running it in serial (to handle JIT compilation):
> >
> > Did it not work before this change? I know Martin added some file locks
> > to prevent simultaneous compilations of the same module.
>
> No, it didn't work before. I get things like
>
> In instant.build_module: Path
> '/home/logg/.instant/cache/form_f38430af401fbeddb9be4091a6fcde37cef9fa35'
> already exists, but module wasn't found in cache previously. Not
> overwriting, assuming this module is valid.
> Traceback (most recent call last):
>   File "demo.py", line 23, in <module>
>     V = FunctionSpace(mesh, "CG", 1)
>   File
>  
> "/home/logg/scratch/src/fenics-dev/dolfin-dev/local/lib/python2.6/site-pack
>ages/dolfin/functionspace.py", line 181, in __init__
>     FunctionSpaceBase.__init__(self, mesh, element)
>   File
>  
> "/home/logg/scratch/src/fenics-dev/dolfin-dev/local/lib/python2.6/site-pack
>ages/dolfin/functionspace.py", line 43, in __init__
>     ufc_element, ufc_dofmap = jit(self._element)
>   File
>  
> "/home/logg/scratch/src/fenics-dev/dolfin-dev/local/lib/python2.6/site-pack
>ages/dolfin/jit.py", line 67, in jit
>     return jit_compile(form, options)
>   File
>  
> "/home/logg/scratch/lib/fenics-dev/lib/python2.6/site-packages/ffc/jit/jit.
>py", line 56, in jit
>     return jit_element(object, options)
>   File
>  
> "/home/logg/scratch/lib/fenics-dev/lib/python2.6/site-packages/ffc/jit/jit.
>py", line 125, in jit_element
>     (compiled_form, module, form_data) = jit_form(form, options)
>   File
>  
> "/home/logg/scratch/lib/fenics-dev/lib/python2.6/site-packages/ffc/jit/jit.
>py", line 102, in jit_form
>     os.unlink(signature + ".h")
>   OSError: [Errno 2] No such file or directory:
>   'form_f38430af401fbeddb9be4091a6fcde37cef9fa35.h'

It looks like the error comes from unlinking a file more than one time (done 
in ffc/jit.py), and not in instant. I will look at it.

> I guess the second process tries to read the generated file but
> it's not ready yet (still being generated by the first process).
>
> It would be good to handle the parallel JIT compilation as part of
> Instant, but I don't know what the best solution is.
>
> > >   mpirun -n 4 python demo.py
> >
> > Do I have to set some environmental variables to make this work. I can't
> > get it to work (probably some stupid error) :P
>
> No, nothing. It should work out of the box.
>
> > Johan
> >
> > When running the above command I get:
> >
> > ssh: connect to host hake-laptop port 22: Connection refused
>
> Can you run other processes in parallel?
>
>   mpirun -n 4 ls
>
> Maybe you need to install sshd? I didn't know it was required.

Yes, that did the trick! openssh-server in ubuntu, btw, and I also had to put 
my public ssh keys in my own authorized keys. 

Johan

> --
> Anders
>
> > -------------------------------------------------------------------------
> >- A daemon (pid 32065) died unexpectedly with status 255 while attempting
> > to launch so we are aborting.
> >
> > There may be more information reported by the environment (see above).
> >
> > This may be because the daemon was unable to find all the needed shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> > the location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> > -------------------------------------------------------------------------
> >-
> > -------------------------------------------------------------------------
> >- mpirun noticed that the job aborted, but has no info as to the process
> > that caused that situation.
> > -------------------------------------------------------------------------
> >- mpirun: clean termination accomplished


Follow ups

References