dolfin team mailing list archive

Thread
Date

Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.

To: Anders Logg <logg@xxxxxxxxx>
From: "Garth N. Wells" <gnw20@xxxxxxxxx>
Date: Mon, 09 Aug 2010 12:42:32 +0100
Cc: dolfin@xxxxxxxxxxxxxxxxxxx
In-reply-to: <20100809113111.GK18320@olorin>

On Mon, 2010-08-09 at 13:31 +0200, Anders Logg wrote:
> On Fri, Aug 06, 2010 at 08:26:38PM +0100, Garth N. Wells wrote:
> > On Fri, 2010-08-06 at 21:19 +0200, Anders Logg wrote:
> > > On Fri, Aug 06, 2010 at 08:13:13PM +0100, Garth N. Wells wrote:
> > > > On Fri, 2010-08-06 at 21:06 +0200, Anders Logg wrote:
> > > > > On Fri, Aug 06, 2010 at 07:55:54PM +0100, Garth N. Wells wrote:
> > > > > > On Fri, 2010-08-06 at 20:53 +0200, Anders Logg wrote:
> > > > > > > On Fri, Aug 06, 2010 at 07:51:18PM +0100, Garth N. Wells wrote:
> > > > > > > > On Fri, 2010-08-06 at 20:36 +0200, Anders Logg wrote:
> > > > > > > > > On Fri, Aug 06, 2010 at 04:55:44PM +0100, Garth N. Wells wrote:
> > > > > > > > > > On Fri, 2010-08-06 at 08:42 -0700, Johan Hake wrote:
> > > > > > > > > > > On Friday August 6 2010 08:16:26 you wrote:
> > > > > > > > > > > > ------------------------------------------------------------
> > > > > > > > > > > > revno: 4896
> > > > > > > > > > > > committer: Garth N. Wells <gnw20@xxxxxxxxx>
> > > > > > > > > > > > branch nick: dolfin-all
> > > > > > > > > > > > timestamp: Fri 2010-08-06 16:13:29 +0100
> > > > > > > > > > > > message:
> > > > > > > > > > > >   Add simple Stokes solver for parallel testing.
> > > > > > > > > > > >
> > > > > > > > > > > >   Other Stokes demos don't run in parallel because MeshFunction io is not
> > > > > > > > > > > >   supported in parallel.
> > > > > > > > > > >
> > > > > > > > > > > Does anyone have an overview of what is needed for this to be fixed. I
> > > > > > > > > > > couldn't find a blueprint on it.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Here it is:
> > > > > > > > > >
> > > > > > > > > >     https://blueprints.launchpad.net/dolfin/+spec/parallel-io
> > > > > > > > > >
> > > > > > > > > > > I am interested in getting this fixed :)
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Me too! We need to look at all the io since much of it is broken in
> > > > > > > > > > parallel.
> > > > > > > > > >
> > > > > > > > > > We need to settle on how to handle XML data. I favour (and I know Niclas
> > > > > > > > > > Janson does too) the VTK approach in which we have a 'master file' that
> > > > > > > > > > points to other XML files which contain portions of the vector/mesh,
> > > > > > > > > > etc. Process zero can read the 'master file' and then instruct the other
> > > > > > > > > > processes on which file(s) they should read in.
> > > > > > > > >
> > > > > > > > > This only works if the data is already partitioned. Most of our demos
> > > > > > > > > assume that we have the mesh in one single file which is then
> > > > > > > > > partitioned on the fly.
> > > > > > > > >
> > > > > > > >
> > > > > > > > The approach does work for data which is not partitioned. Just like with
> > > > > > > > VTK, one can read the 'master file' or the individual files.
> > > > > > > >
> > > > > > > > > The initial plan was to support two different ways of reading data in parallel:
> > > > > > > > >
> > > > > > > > > 1. One file and automatic partitioning
> > > > > > > > >
> > > > > > > > > DOLFIN gets one file "mesh.xml", each process reads one part of it (just
> > > > > > > > > skipping other parts of the file), then the mesh is partitioned and
> > > > > > > > > redistributed.
> > > > > > > > >
> > > > > > > > > 2. Several files and no partitioning
> > > > > > > > >
> > > > > > > > > DOLFIN get multiple files and each process reads one part. In this
> > > > > > > > > case, the mesh and all associated data is already partitioned. This
> > > > > > > > > should be very easy to fix since everything that is needed is already
> > > > > > > > > in place; we just need to fix the logic. In particular, the data
> > > > > > > > > section of each local mesh contains all auxilliary parallel data.
> > > > > > > > >
> > > > > > > > > This can be handled in two different ways. Either a user specifies the
> > > > > > > > > name of the file as "mesh*.xml", in which case DOLFIN appends say
> > > > > > > > >
> > > > > > > > >   "_%d" % MPI::process_number()
> > > > > > > > >
> > > > > > > > > on each local process.
> > > > > > > > >
> > > > > > > > > The other way is to have a master file which lists all the other
> > > > > > > > > files. In this case, I don't see a need for process 0 to take any kind
> > > > > > > > > of responsibility for communicating file names. It would work fine for
> > > > > > > > > each process to read the master file and then check which file it
> > > > > > > > > should use. Each process could also check that the total number of
> > > > > > > > > processes matches the number of partitions in the file. We could let
> > > > > > > > > process 0 handle the parsing of the master file and then communicate
> > > > > > > > > the file names but maybe that is an extra complication.
> > > > > > > > >
> > > > > > > >
> > > > > > > > This fails when the number of files differs from the number of
> > > > > > > > processes. It's very important to support m files on n processes. We've
> > > > > > > > discussed this at length before.
> > > > > > >
> > > > > > > I don't remember. Can you remind me of what the reasons are?
> > > > > > >
> > > > > >
> > > > > > I perform a simulation using m processes, and write the result to m
> > > > > > files. Later I want to use the result later in another computation using
> > > > > > n processors.
> > > > >
> > > > > I assume you did your first simulation (with m processors) starting
> > > > > from one big file?
> > > > >
> > > >
> > > > What do you mean? The first simulation might not read in any file.
> > >
> > > I assumed you had an input for your mesh with some nontrivial
> > > geometry. The only other possibilities I see are either if you use one
> > > of the builtin meshes or if you have some custom program that
> > > generates your mesh. Or is the thing that you have done mesh
> > > refinement?
> > >
> > > > > Can't you just restart from that file when you later want to run with
> > > > > n processors? It would not be much extra work, and maybe it would even
> > > > > be faster considering all the extra communication.
> > > > >
> > > >
> > > > The communication cost is negligible when reading a vector just once and
> > > > distributing.
> > > >
> > > > It will be very easy to read m files on n processes, so I don't get why
> > > > would we wish to prevent it?
> > >
> > > Yes, it would be easy but we would need to redistribute it before the
> > > call to ParMETIS.
> > >
> > > Another "problem" is if m > n. Then we would have to decide which
> > > processes read multiple files and how many.
> > >
> > > > Also, we can't rely in the data that we read in being suitably
> > > > partitioned.
> > >
> > > In my thinking, that is one of the main points of being able to read
> > > in multiple files, that you have already done the work of the
> > > partitioning and have a ready-made partition and just want to run the
> > > simulation again, or perhaps another simulation on the same mesh. In
> > > other words to allow computing in parallel without needing to go
> > > through the partitioning step.
> > >
> >
> > Partitioning is not the most important issue. The important issue is
> > scalability - if I use 10^3 processes, I can't be gathering all data on
> > process 0 for io, at possibly many steps in a simulation. We should be
> > looking at this as a questions of io.
> >
> > On a shared cluster one usually has time limits and will need restarts,
> > but may not always have access to the same number of processors.
> >
> > Garth
> 
> ok, good point. I agree we need to support n != m.
> 
> A few questions arise:
> 
> 1. Should we simply treat it as we do when reading one mesh, that is,
> first each process reads some data (from one or multiple files). Then
> the data is partitioned with ParMETIS and then redistributed.
> 

Yes, plus an option of reading in a partition file that prescribes the
partition (but go stick with a one-file approach).

> We might need an extra redistribution step before calling ParMETIS if
> it is not possible to organize the data directly when reading so that
> each process gets roughly the same amount of data.
> 
> 2. Should we have a special case for m = n where ParMETIS is not
> called so that one can reuse an old partitioning when doing restart?
> 

Best to read in a file that defines the partition. 

It's would be simplest to stick with a one-file approach. The XML format
is likely to be unsuitable for very large data, so we can eventually
implement (parallel) HDF5 output of a mesh.

Parallel HDF5 should take care of the case in which a file is too big to
opened on a single process.

Garth

> --
> Anders

References

Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Johan Hake, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Anders Logg, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Anders Logg, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Anders Logg, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Anders Logg, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Anders Logg, 2010-08-09