dolfin team mailing list archive

Thread
Date

Re: Assembly timings

To: Discussion of DOLFIN development <dolfin-dev@xxxxxxxxxx>
From: Robert C. Kirby <kirby@xxxxxxxxxxxx>
Date: Mon, 26 Sep 2005 21:47:05 -0500
In-reply-to: <43388DBC.9090104@uchicago.edu>
Reply-to: Discussion of DOLFIN development <dolfin-dev@xxxxxxxxxx>

Great, in particular, it will be useful to know, say for Poisson andTaylor-Hood elements in Stokes, at what polynomial degree the binarysearch tree starts to beat PETSc? It should be easy to just run fromk=1..5 and see what happens. Probably doing this test on the same 128x 128 mesh is fine.

If we can get the breaking point at k=2 or 3, then it might make senseto try optimizing the binary search tree code. We will have to fightthe battle that "assembly time doesn't matter, it's solve time thatcounts". If we're only getting 10-15% speedup on assembly time andassembly time is 20% of solve time, this isn't really anything to bragabout (.1 * .2 = an overall speedup of 2%. I wouldn't try to publishthat). On the other hand, we might try to find the best solver inhypre to use (Rob Falgout claims there are things that destroy his AMGfor Poisson).

On the other hand, maybe we should think big. The asymptotics favorbinary trees as the number of nonzeros increase per row. One could askif we formed the Jacobian for MHD using primitive variables withTaylor-Hood for fluid and piecewise linear Lagrange forelectromagnetics whether that gives us enough dof per row without goingto high polynomial degree.

Finally, we must prioritize -- I think that getting some results forvarious formulations of Stokes measuring error in pressure & velocityversus dof and nonzeros in the matrix is a more interesting result andhence valuable use of time.


Any thoughts?

Rob


On Sep 26, 2005, at 7:09 PM, Andy Ray Terrel wrote:

Yeah I will run some long scripts, I hadn't tried to blow the cache.It seemed anything that didn't blow the cache had the same ratiobetween the different assemblies.


Andy

Anders Logg wrote:

Good point. Andy did Stokes originally though, with similar results.

We also tried a number of different problem sizes, but not higherdegree.


Maybe Andy could make some more extensive testing now that we both
have looked at the code and know it's doing what it's supposed to do.

/Anders

On Mon, Sep 26, 2005 at 05:47:01PM -0500, Robert C.Kirby wrote:

Here is another thing to do before jumping into optimization: Youhave some knobs, turn them:


1.) *Much* larger problem.  Get out of cache, just in case.

2.) Try higher polynomial degree. The number of nonzeros on a givenrow just isn't that large for linears. Try k=2,3,4,5.3.) Try a mixed operator (say Stokes) -- there will be even morenonzeros per row.

PETSc internally has an O(n^2) algorithm, you are doing O(n log n).Don't give up on linears -- most n log n algorithms lose for low n.


Once you find a turning point (if ever), then worry about optimizing.

Rob



On Sep 26, 2005, at 3:46 PM, Anders Logg wrote:

Hey everyone.

Andy (Terrel) has been trying out some new ways to improve the speed

of assembly in DOLFIN. We just sat down to make some timings andhere

are the results/conclusions. Comments and suggestions are welcome.

The basic idea is to store the sparse matrix as a std::vector of
std::maps (implemented as a binary tree):

  std::vector<std::map<int, real> > rows(M);

With this data structure, we replace the line

  A.add(block, test_dofs, m, trial_dofs, n);

which just calls PETSc MatSetValues with the following code
for insertion into the sparse matrix

for(uint i=0; i<m; i++)
{
  std::map<int, real>& row = rows[test_dofs[i]];
  for(uint j = 0; j < n; j++)
  {
      const int J = trial_dofs[j];
      iter = row.find(J);
      const real val = block[i*n+j];
      if ( iter == row.end() )
          row.insert(std::map<int,real>::value_type(J,val));
	else
	    (*iter).second += val;
  }
}

This is done once for each element. At the end of assembly, we copy
the values into a PETSc matrix.

The result: PETSc MatSetValues is about twice as fast as the STL.

Here are the numbers, broken into timings for the different stagesof

assembly (times in seconds for assembly of Poisson on a 128x128
triangular mesh):

  - Iteration over mesh + misc stuff    0.04
  - Mapping dofs, updating affine map   0.23 (this can be faster)
  - Computing element matrix            0.01  :-)
  - Inserting into STL binary tree      0.58
  - PETSc assemble begin/end            0.05
  - Copying data to PETSc matrix        0.18 (this can be faster)

As a comparison, just calling PETSc MatSetValues takes 0.24s, which
should be compared to the sum 0.58s + 0.18s.

Perhaps inserting into the STL binary tree can be improved if one
could reserve the size of each map before assembly. (Anyone knows if
this is possible? Something like std::vector::reserve()?)

Again, PETSc wins. The results from previous benchmarks againstPETSc

seem to hold: PETSc is about twice as a simple sparse data
structure.

This seems to hold both when the the sparse data structure is

  double** values;
  int** columns;

or when it is

  std::vector<std::map<int, real> > rows(M);

/Anders

_______________________________________________
DOLFIN-dev mailing list
DOLFIN-dev@xxxxxxxxxx
http://www.fenics.org/cgi-bin/mailman/listinfo/dolfin-dev

_______________________________________________
DOLFIN-dev mailing list
DOLFIN-dev@xxxxxxxxxx
http://www.fenics.org/cgi-bin/mailman/listinfo/dolfin-dev


--
====================
Andy Terrel
Computer Science Dept
University of Chicago
aterrel@xxxxxxxxxxxx
---------------------


_______________________________________________
DOLFIN-dev mailing list
DOLFIN-dev@xxxxxxxxxx
http://www.fenics.org/cgi-bin/mailman/listinfo/dolfin-dev

Follow ups

Re: Assembly timings
From: Anders Logg, 2005-09-27

References

Assembly timings
From: Anders Logg, 2005-09-26
Re: Assembly timings
From: Robert C . Kirby, 2005-09-26
Re: Assembly timings
From: Anders Logg, 2005-09-26
Re: Assembly timings
From: Andy Ray Terrel, 2005-09-27