ffc team mailing list archive

Thread
Date
quadrature optimisations

To: "ffc-dev@xxxxxxxxxx" <ffc-dev@xxxxxxxxxx>
From: Kristian Oelgaard <k.b.oelgaard@xxxxxxxxxx>
Date: Mon, 08 Sep 2008 15:40:55 +0200
Delivered-to: ffc-dev@xxxxxxxxxx
User-agent: Internet Messaging Program (IMP) 3.1
Hi,

Here is a comparison between tensor representation and the previous
quadrature representation and the new and optimised version of quadrature
representation.

The FFC compile time is measured as follows:
 - simplify,  the time spent on simplifying the expression
 - repres.,   the time spent on computing the representation
 - code gen., the time spent on actual code generation
 - FFC,       total time spent on compiling the form

  The 3 stages (simplify, repres. and code gen.) accounts for around 95% of
  the FFC compile time.

 - size, is the size of the header file.

 - DOLFIN, is the time spent on compiling a simple main.cpp file including
           the generated header file against DOLFIN.

 - run, is the runtime measured as the time it takes to call tabulate_tensor()
        N times. No assembly is performed. If a form contains facet integrals
        tabulate_tensor() is called for each of the cases. E.g., a DG form in
        3D with one interior facet integral will call tabulate_tensor()
        N*4*4 times.

 - fac., is the runtime divided by the runtime for the previous version of
         quadrature representation

All forms are bilinear forms.

Elasticity 3D, 2nd order elements, N = 500,000
Description: No functions, just basisfunctions and geometry terms
                                                                
         simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
tensor     0.21s    0.05s      0.93s   1.41s  598kb   8.17s     1.8s   0.047
old quad   0.21s    0.05s      0.98s   1.45s  476kb   7.39s    38.5s   1.000
new quad   0.21s    0.05s      0.98s   1.48s  464kb   7.34s    31.0s   0.805

Note: For forms without any functions tensor reprentation is ALWAYS much
      faster. (about 17 times in this case)


Plasticity 2D, 1st order elements, N = 100,000,000
Description: 9 component tangent defined on VectorQuadratureElement

         simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
tensor     0.14s    0.11s      0.20s   0.62s  232kb   6.36s    25.5s   0.560
old quad   0.14s    0.06s      0.37s   0.77s  228kb   6.33s    45.5s   1.000
new quad   0.14s    0.06s      0.24s   0.64s  230kb   6.27s    20.8s   0.457

Note: Not much difference between tensor and the new quadrauture
      representation, both are about 2 times faster than the old version of
      quadrature representation.


Plasticity 2D, 3rd order elements, N = 500,000
Description: 9 component tangent defined on VectorQuadratureElement

         simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
tensor     0.14s    0.55s      3.37s   4.22s  1.8MB  43.61s    54.5s   1.518
old quad   0.14s    0.14s      1.69s   2.19s  414kb   7.58s    35.9s   1.000
new quad   0.14s    0.15s      0.50s   1.00s  410kb   7.51s    10.5s   0.292

Note: For higher order elements, the code generated by tensor representation
      grows in size increasing the DOLFIN compile time. The new quadrature
      is 3 and 5 times faster than the old quadrature and tensor respectively.
      The FFC compile time is also 2-4 times faster (not that it makes much of
      a difference since the total compile time is only 1 sec.)


Plasticity 3D, 1st order elements, N = 10,000,000
Description: 36 component tangent defined on VectorQuadratureElement

         simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
tensor     2.04s    3.36s      5.77s  11.89s  775kb  12.76s    52.9s   1.441
old quad   2.04s    0.86s     11.71s  15.35s  670kb  11.72s    36.7s   1.000
new quad   2.01s    0.85s      1.78s   5.33s  693kb  11.89s    19.0s   0.518

Note: The new quadrature compiles 2-3 times faster with FFC and is 2-3 times
      faster at runtime.


Plasticity 3D, 2nd order elements, N = 100,000
Description: 36 component tangent defined on VectorQuadratureElement

         simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
tensor     2.03s   34.93s    236.6s  275.30s  11MB     *        ---    ---
old quad   2.05s    2.15s     68.3s   73.30s  1.4MB  16.89s    37.8s   1.000
new quad   2.04s    2.15s      2.9s    7.82s  1.4MB  16.67s     6.7s   0.177

* ran out of memory after 8min.
  cc1plus: out of memory allocating 1477058608 bytes after a total\
   of 134725632 bytes
  (also tried to split FFC output in *.h and *.cpp, same result)

Note: Tensor representation takes forever to compile with FFC and the
      resulting code can't be compiled against DOLFIN. The new quadrature
      compiles 10 times faster with FFC and runs about 5 times faster.


PressureEquation 2D, 2nd order elements, N = 100,000
Description: Many, many functions

         simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
tensor    23.7s     0.45s      2.20s  29.0s   2.6MB  36.05s     6.76s  0.0168
old quad  23.5s     0.41s     16.48s  43.1s   556kb   9.02s   400.40s  1.000
new quad  23.7s     0.41s      3.05s  29.9s   544kb   8.69s     1.03s  0.0025

Note: The FFC compile time has been reduced for the new quadrature so that
      it's comparable to that of tensor representation, note that most time
      is spent by simplify. The runtime is now 6-7 times faster than tensor
      representation which is almost 400!! times faster than the old version
      of quadrature.


BiharmonicDG_2D, 3rd order elements, N = 200,000
Description: Interior facet integrals, higher order derivatives.

         simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
tensor     1.11s    1.61s     13.67s  16.70s  3.2MB  46.26s    31.6s   0.280
old quad   1.12s    1.26s      4.89s   7.64s  487kb   9.62s   112.9s   1.000
new quad   1.12s    1.25s      2.95s   5.72s  427kb   7.80s    33.7s   0.298

Note: Faster compile time for both FFC and DOLFIN compared to tensor, and an
      equivalent runtime performance.
      (factor 3 better than the old quadrature)


BiharmonicDG_3D, 3rd order elements, N = 2,000
Description: Interior facet integrals, higher order derivatives.

         simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
tensor     2.70s     *         ---     ---    ---     ---       ---    ---
old quad   2.70s    7.86s     60.5s   72.0s   2.9MB  70.2s     51.5s   1.000
new quad   2.65s    7.79s     28.7s   39.9s   2.4MB  36.8s     10.4s   0.202

tensor     2.70s     *         ---     ---     ---    ---       ---    ---
old quad   2.70s    7.86s     60.5s   72.0s   2.9MB  70.2s     51.5s   1.000
new quad   2.65s    7.79s     28.7s   39.9s   2.4MB  36.8s     10.4s   0.202

* MemoryError during compute representation

  Note: A factor of 2 speed-up at the code generation stage, and less
        code as output. 2 times faster DOLFIN compile time and 5 times faster
        at runtime.


DGSGPa, 3D linear elements, N = 20000
Description: DG strain gradient plasticity form, among other crazy things
             a 81 component tangent on linear discontinuous elements.
                                                                                  
         simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.  
tensor     199s      *         ---     ---    ---     ---       ---    ---  
old quad   200s     768s      1485s   2462s   11MB    220s     34.7s   1.000
new quad   201s     763s       167s   1141s   9.0MB   114s     20.7s   0.628

* MemoryError during compute representation

Note: The FFC compile time has been reduced by a factor 2, also note that
      the code generation is now faster than simplifying the expression. It
      might be possible to optimise the representation stage by cutting some
      corners, but that is for later. The DOLFIN compile time is a factor 2
      faster, but unfortunately it did not have that big an impact on the
      runtime performance.


CahnHilliard,  Linear elements, N = 200000
Description: Many functions.
          simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.  
old quad a   6.88s   11.5s    640s      ---     ---    ---       ---
old quad L   2.98s  237.1s   5571s     6470s   1.9MB  23.2s     72.1s   1.000
new quad a   6.61s   10.7s      2.30s   ---     ---    ---       ---    ---
new quad L   3.14s  229.7s      1.63s   258s   1.9MB  20.8s      1.50s  0.021

Note: I'll let the numbers on FFC compile time and runtime speak for
      themselves.


Kristian
Follow ups

Re: quadrature optimisations
From: Anders Logg, 2008-09-08