dolfin team mailing list archive

Thread
Date

Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.

To: "Garth N. Wells" <gnw20@xxxxxxxxx>
From: Anders Logg <logg@xxxxxxxxx>
Date: Mon, 9 Aug 2010 13:31:11 +0200
Cc: dolfin@xxxxxxxxxxxxxxxxxxx
In-reply-to: <1281122798.1759.24.camel@garth-laptop>
User-agent: Mutt/1.5.20 (2009-06-14)

On Fri, Aug 06, 2010 at 08:26:38PM +0100, Garth N. Wells wrote:
> On Fri, 2010-08-06 at 21:19 +0200, Anders Logg wrote:
> > On Fri, Aug 06, 2010 at 08:13:13PM +0100, Garth N. Wells wrote:
> > > On Fri, 2010-08-06 at 21:06 +0200, Anders Logg wrote:
> > > > On Fri, Aug 06, 2010 at 07:55:54PM +0100, Garth N. Wells wrote:
> > > > > On Fri, 2010-08-06 at 20:53 +0200, Anders Logg wrote:
> > > > > > On Fri, Aug 06, 2010 at 07:51:18PM +0100, Garth N. Wells wrote:
> > > > > > > On Fri, 2010-08-06 at 20:36 +0200, Anders Logg wrote:
> > > > > > > > On Fri, Aug 06, 2010 at 04:55:44PM +0100, Garth N. Wells wrote:
> > > > > > > > > On Fri, 2010-08-06 at 08:42 -0700, Johan Hake wrote:
> > > > > > > > > > On Friday August 6 2010 08:16:26 you wrote:
> > > > > > > > > > > ------------------------------------------------------------
> > > > > > > > > > > revno: 4896
> > > > > > > > > > > committer: Garth N. Wells <gnw20@xxxxxxxxx>
> > > > > > > > > > > branch nick: dolfin-all
> > > > > > > > > > > timestamp: Fri 2010-08-06 16:13:29 +0100
> > > > > > > > > > > message:
> > > > > > > > > > >   Add simple Stokes solver for parallel testing.
> > > > > > > > > > >
> > > > > > > > > > >   Other Stokes demos don't run in parallel because MeshFunction io is not
> > > > > > > > > > >   supported in parallel.
> > > > > > > > > >
> > > > > > > > > > Does anyone have an overview of what is needed for this to be fixed. I
> > > > > > > > > > couldn't find a blueprint on it.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Here it is:
> > > > > > > > >
> > > > > > > > >     https://blueprints.launchpad.net/dolfin/+spec/parallel-io
> > > > > > > > >
> > > > > > > > > > I am interested in getting this fixed :)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Me too! We need to look at all the io since much of it is broken in
> > > > > > > > > parallel.
> > > > > > > > >
> > > > > > > > > We need to settle on how to handle XML data. I favour (and I know Niclas
> > > > > > > > > Janson does too) the VTK approach in which we have a 'master file' that
> > > > > > > > > points to other XML files which contain portions of the vector/mesh,
> > > > > > > > > etc. Process zero can read the 'master file' and then instruct the other
> > > > > > > > > processes on which file(s) they should read in.
> > > > > > > >
> > > > > > > > This only works if the data is already partitioned. Most of our demos
> > > > > > > > assume that we have the mesh in one single file which is then
> > > > > > > > partitioned on the fly.
> > > > > > > >
> > > > > > >
> > > > > > > The approach does work for data which is not partitioned. Just like with
> > > > > > > VTK, one can read the 'master file' or the individual files.
> > > > > > >
> > > > > > > > The initial plan was to support two different ways of reading data in parallel:
> > > > > > > >
> > > > > > > > 1. One file and automatic partitioning
> > > > > > > >
> > > > > > > > DOLFIN gets one file "mesh.xml", each process reads one part of it (just
> > > > > > > > skipping other parts of the file), then the mesh is partitioned and
> > > > > > > > redistributed.
> > > > > > > >
> > > > > > > > 2. Several files and no partitioning
> > > > > > > >
> > > > > > > > DOLFIN get multiple files and each process reads one part. In this
> > > > > > > > case, the mesh and all associated data is already partitioned. This
> > > > > > > > should be very easy to fix since everything that is needed is already
> > > > > > > > in place; we just need to fix the logic. In particular, the data
> > > > > > > > section of each local mesh contains all auxilliary parallel data.
> > > > > > > >
> > > > > > > > This can be handled in two different ways. Either a user specifies the
> > > > > > > > name of the file as "mesh*.xml", in which case DOLFIN appends say
> > > > > > > >
> > > > > > > >   "_%d" % MPI::process_number()
> > > > > > > >
> > > > > > > > on each local process.
> > > > > > > >
> > > > > > > > The other way is to have a master file which lists all the other
> > > > > > > > files. In this case, I don't see a need for process 0 to take any kind
> > > > > > > > of responsibility for communicating file names. It would work fine for
> > > > > > > > each process to read the master file and then check which file it
> > > > > > > > should use. Each process could also check that the total number of
> > > > > > > > processes matches the number of partitions in the file. We could let
> > > > > > > > process 0 handle the parsing of the master file and then communicate
> > > > > > > > the file names but maybe that is an extra complication.
> > > > > > > >
> > > > > > >
> > > > > > > This fails when the number of files differs from the number of
> > > > > > > processes. It's very important to support m files on n processes. We've
> > > > > > > discussed this at length before.
> > > > > >
> > > > > > I don't remember. Can you remind me of what the reasons are?
> > > > > >
> > > > >
> > > > > I perform a simulation using m processes, and write the result to m
> > > > > files. Later I want to use the result later in another computation using
> > > > > n processors.
> > > >
> > > > I assume you did your first simulation (with m processors) starting
> > > > from one big file?
> > > >
> > >
> > > What do you mean? The first simulation might not read in any file.
> >
> > I assumed you had an input for your mesh with some nontrivial
> > geometry. The only other possibilities I see are either if you use one
> > of the builtin meshes or if you have some custom program that
> > generates your mesh. Or is the thing that you have done mesh
> > refinement?
> >
> > > > Can't you just restart from that file when you later want to run with
> > > > n processors? It would not be much extra work, and maybe it would even
> > > > be faster considering all the extra communication.
> > > >
> > >
> > > The communication cost is negligible when reading a vector just once and
> > > distributing.
> > >
> > > It will be very easy to read m files on n processes, so I don't get why
> > > would we wish to prevent it?
> >
> > Yes, it would be easy but we would need to redistribute it before the
> > call to ParMETIS.
> >
> > Another "problem" is if m > n. Then we would have to decide which
> > processes read multiple files and how many.
> >
> > > Also, we can't rely in the data that we read in being suitably
> > > partitioned.
> >
> > In my thinking, that is one of the main points of being able to read
> > in multiple files, that you have already done the work of the
> > partitioning and have a ready-made partition and just want to run the
> > simulation again, or perhaps another simulation on the same mesh. In
> > other words to allow computing in parallel without needing to go
> > through the partitioning step.
> >
>
> Partitioning is not the most important issue. The important issue is
> scalability - if I use 10^3 processes, I can't be gathering all data on
> process 0 for io, at possibly many steps in a simulation. We should be
> looking at this as a questions of io.
>
> On a shared cluster one usually has time limits and will need restarts,
> but may not always have access to the same number of processors.
>
> Garth

ok, good point. I agree we need to support n != m.

A few questions arise:

1. Should we simply treat it as we do when reading one mesh, that is,
first each process reads some data (from one or multiple files). Then
the data is partitioned with ParMETIS and then redistributed.

We might need an extra redistribution step before calling ParMETIS if
it is not possible to organize the data directly when reading so that
each process gets roughly the same amount of data.

2. Should we have a special case for m = n where ParMETIS is not
called so that one can reuse an old partitioning when doing restart?

--
Anders

Attachment: signature.asc
Description: Digital signature

Follow ups

Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-09

References

Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Johan Hake, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Anders Logg, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Anders Logg, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Anders Logg, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Anders Logg, 2010-08-06
Re: [Branch ~dolfin-core/dolfin/main] Rev 4896: Add simple Stokes solver for parallel testing.
From: Garth N. Wells, 2010-08-06