← Back to team overview

dolfin team mailing list archive

Re: Unit test problem in parallel

 

Johan Hake <johan.hake@xxxxxxxxx> writes:

> On Wednesday March 30 2011 10:02:28 Niclas Jansson wrote:
>> Johan Hake <johan.hake@xxxxxxxxx> writes:
>> > On Tuesday March 29 2011 23:27:27 Anders Logg wrote:
>> >> On Tue, Mar 29, 2011 at 11:10:17PM -0700, Johan Hake wrote:
>> >> > What triggers the error? Is it writing and/or reading to/from file. Is
>> >> > it assignment of data from within the read function in the test?
>> >> > 
>> >> > johan
>> >> 
>> >> It's the next line following the read:
>> >>   std::string filename(p1["filename"]);
>> > 
>> > It took some time to find the parameter unit test ;)
>> > 
>> >> So something goes wrong for at least one of the processes when the
>> >> parameters are read back from file. Here's what happens:
>> >> 
>> >> 1. All processes create parameter set p0
>> >> 
>> >> 2. Process 0 writes p0 to file
>> >> 
>> >> 3. Everyone waits (barrier)
>> >> 
>> >> 4. All processes read from the file into p1
>> >> 
>> >> 5. All processes access parameters from p1 and compare to p0
>> > 
>> > I guess it is 4 that goes wrong. I have tried to google varieties of
>> > "open shared file fstream". It looks like others have had the same
>> > problem.
>> > 
>> > Johan
>> 
>> I don't think it's enough with a barrier. It doesn't guarantee that
>> the data is flushed to the disk.
>
> But how could it work for some processes and not for others. Doesn't this 
> indicate that the file is properly created? 
>
> Johan
>

True...

But, say that the file is flushed some time after the barrier. Maybe
that is long enough for some of the processes to reach the "File f1"
statement before the file is flushed. The others arrives a bit later
and gets a valid file pointer.

Niclas

>> An option is of course to use MPI I/O, but that would lead to a
>> painful rewrite of most I/O routines...
>> 
>> Niclas
>> 
>> >> --
>> >> Anders
>> >> 
>> >> > On Tuesday March 29 2011 22:53:01 Anders Logg wrote:
>> >> > > The parameter unit test is sometimes failing in parallel. On my
>> >> > > local machine it always seems to work with 2 or 3 processes, but
>> >> > > sometimes it fails with 4, giving the same error message as the
>> >> > > buildbot:
>> >> > > 
>> >> > > ##Failure Location unknown## : Error
>> >> > > Test name: InputOutput::test_simple
>> >> > > uncaught exception of type St13runtime_error
>> >> > > - *** Error: Unable to access parameter "filename" in parameter set
>> >> > > "test", par
>> >> > > ameter not defined.
>> >> > > 
>> >> > > Failures !!!
>> >> > > Run: 2   Failure total: 1   Failures: 0   Errors: 1
>> >> > > 
>> >> > > There is a check for which process writes to file and a barrier that
>> >> > > should make sure everyone waits until the file gets written.
>> >> > > 
>> >> > >   // Save to file
>> >> > >   if (dolfin::MPI::process_number() == 0)
>> >> > >   {
>> >> > >   
>> >> > >     File f0("test_parameters.xml");
>> >> > >     f0 << p0;
>> >> > >   
>> >> > >   }
>> >> > >   dolfin::MPI::barrier();
>> >> > >   
>> >> > >   // Read from file
>> >> > >   Parameters p1;
>> >> > >   File f1("test_parameters.xml");
>> >> > >   f1 >> p1;
>> >> > > 
>> >> > > I thought that should do the trick, but apparently not.
>> >> > > 
>> >> > > Any ideas what goes wrong?
>> > 
>> > _______________________________________________
>> > Mailing list: https://launchpad.net/~dolfin
>> > Post to     : dolfin@xxxxxxxxxxxxxxxxxxx
>> > Unsubscribe : https://launchpad.net/~dolfin
>> > More help   : https://help.launchpad.net/ListHelp



Follow ups

References