← Back to team overview

dolfin team mailing list archive

Re: Unit test problem in parallel

 

On Wednesday March 30 2011 10:15:01 Niclas Jansson wrote:
> Johan Hake <johan.hake@xxxxxxxxx> writes:
> > On Wednesday March 30 2011 10:02:28 Niclas Jansson wrote:
> >> Johan Hake <johan.hake@xxxxxxxxx> writes:
> >> > On Tuesday March 29 2011 23:27:27 Anders Logg wrote:
> >> >> On Tue, Mar 29, 2011 at 11:10:17PM -0700, Johan Hake wrote:
> >> >> > What triggers the error? Is it writing and/or reading to/from file.
> >> >> > Is it assignment of data from within the read function in the
> >> >> > test?
> >> >> > 
> >> >> > johan
> >> >> 
> >> >> It's the next line following the read:
> >> >>   std::string filename(p1["filename"]);
> >> > 
> >> > It took some time to find the parameter unit test ;)
> >> > 
> >> >> So something goes wrong for at least one of the processes when the
> >> >> parameters are read back from file. Here's what happens:
> >> >> 
> >> >> 1. All processes create parameter set p0
> >> >> 
> >> >> 2. Process 0 writes p0 to file
> >> >> 
> >> >> 3. Everyone waits (barrier)
> >> >> 
> >> >> 4. All processes read from the file into p1
> >> >> 
> >> >> 5. All processes access parameters from p1 and compare to p0
> >> > 
> >> > I guess it is 4 that goes wrong. I have tried to google varieties of
> >> > "open shared file fstream". It looks like others have had the same
> >> > problem.
> >> > 
> >> > Johan
> >> 
> >> I don't think it's enough with a barrier. It doesn't guarantee that
> >> the data is flushed to the disk.
> > 
> > But how could it work for some processes and not for others. Doesn't this
> > indicate that the file is properly created?
> > 
> > Johan
> 
> True...
> 
> But, say that the file is flushed some time after the barrier. Maybe
> that is long enough for some of the processes to reach the "File f1"
> statement before the file is flushed. The others arrives a bit later
> and gets a valid file pointer.

Ok, make sense.

A flush at the end of each << call might not hurt anyway.

Johan

> Niclas
> 
> >> An option is of course to use MPI I/O, but that would lead to a
> >> painful rewrite of most I/O routines...
> >> 
> >> Niclas
> >> 
> >> >> --
> >> >> Anders
> >> >> 
> >> >> > On Tuesday March 29 2011 22:53:01 Anders Logg wrote:
> >> >> > > The parameter unit test is sometimes failing in parallel. On my
> >> >> > > local machine it always seems to work with 2 or 3 processes, but
> >> >> > > sometimes it fails with 4, giving the same error message as the
> >> >> > > buildbot:
> >> >> > > 
> >> >> > > ##Failure Location unknown## : Error
> >> >> > > Test name: InputOutput::test_simple
> >> >> > > uncaught exception of type St13runtime_error
> >> >> > > - *** Error: Unable to access parameter "filename" in parameter
> >> >> > > set "test", par
> >> >> > > ameter not defined.
> >> >> > > 
> >> >> > > Failures !!!
> >> >> > > Run: 2   Failure total: 1   Failures: 0   Errors: 1
> >> >> > > 
> >> >> > > There is a check for which process writes to file and a barrier
> >> >> > > that should make sure everyone waits until the file gets
> >> >> > > written.
> >> >> > > 
> >> >> > >   // Save to file
> >> >> > >   if (dolfin::MPI::process_number() == 0)
> >> >> > >   {
> >> >> > >   
> >> >> > >     File f0("test_parameters.xml");
> >> >> > >     f0 << p0;
> >> >> > >   
> >> >> > >   }
> >> >> > >   dolfin::MPI::barrier();
> >> >> > >   
> >> >> > >   // Read from file
> >> >> > >   Parameters p1;
> >> >> > >   File f1("test_parameters.xml");
> >> >> > >   f1 >> p1;
> >> >> > > 
> >> >> > > I thought that should do the trick, but apparently not.
> >> >> > > 
> >> >> > > Any ideas what goes wrong?
> >> > 
> >> > _______________________________________________
> >> > Mailing list: https://launchpad.net/~dolfin
> >> > Post to     : dolfin@xxxxxxxxxxxxxxxxxxx
> >> > Unsubscribe : https://launchpad.net/~dolfin
> >> > More help   : https://help.launchpad.net/ListHelp
> 
> _______________________________________________
> Mailing list: https://launchpad.net/~dolfin
> Post to     : dolfin@xxxxxxxxxxxxxxxxxxx
> Unsubscribe : https://launchpad.net/~dolfin
> More help   : https://help.launchpad.net/ListHelp



Follow ups

References