ffc team mailing list archive

Thread
Date
Re: quadrature optimisations

To: ffc-dev@xxxxxxxxxx
From: Anders Logg <logg@xxxxxxxxx>
Date: Mon, 8 Sep 2008 15:49:49 +0200
Delivered-to: ffc-dev@xxxxxxxxxx
In-reply-to: <1220881255.48c52b67c4cb4@mech001.citg.tudelft.nl>
Mail-followup-to: ffc-dev@xxxxxxxxxx
User-agent: Mutt/1.5.17+20080114 (2008-01-14)
On Mon, Sep 08, 2008 at 03:40:55PM +0200, Kristian Oelgaard wrote:
> 
> Hi,
> 
> Here is a comparison between tensor representation and the previous
> quadrature representation and the new and optimised version of quadrature
> representation.
> 
> The FFC compile time is measured as follows:
>  - simplify,  the time spent on simplifying the expression
>  - repres.,   the time spent on computing the representation
>  - code gen., the time spent on actual code generation
>  - FFC,       total time spent on compiling the form
> 
>   The 3 stages (simplify, repres. and code gen.) accounts for around 95% of
>   the FFC compile time.
> 
>  - size, is the size of the header file.
> 
>  - DOLFIN, is the time spent on compiling a simple main.cpp file including
>            the generated header file against DOLFIN.
> 
>  - run, is the runtime measured as the time it takes to call tabulate_tensor()
>         N times. No assembly is performed. If a form contains facet integrals
>         tabulate_tensor() is called for each of the cases. E.g., a DG form in
>         3D with one interior facet integral will call tabulate_tensor()
>         N*4*4 times.
> 
>  - fac., is the runtime divided by the runtime for the previous version of
>          quadrature representation
> 
> All forms are bilinear forms.
> 
> Elasticity 3D, 2nd order elements, N = 500,000
> Description: No functions, just basisfunctions and geometry terms
>                                                                 
>          simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
> tensor     0.21s    0.05s      0.93s   1.41s  598kb   8.17s     1.8s   0.047
> old quad   0.21s    0.05s      0.98s   1.45s  476kb   7.39s    38.5s   1.000
> new quad   0.21s    0.05s      0.98s   1.48s  464kb   7.34s    31.0s   0.805
> 
> Note: For forms without any functions tensor reprentation is ALWAYS much
>       faster. (about 17 times in this case)
> 
> 
> Plasticity 2D, 1st order elements, N = 100,000,000
> Description: 9 component tangent defined on VectorQuadratureElement
> 
>          simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
> tensor     0.14s    0.11s      0.20s   0.62s  232kb   6.36s    25.5s   0.560
> old quad   0.14s    0.06s      0.37s   0.77s  228kb   6.33s    45.5s   1.000
> new quad   0.14s    0.06s      0.24s   0.64s  230kb   6.27s    20.8s   0.457
> 
> Note: Not much difference between tensor and the new quadrauture
>       representation, both are about 2 times faster than the old version of
>       quadrature representation.
> 
> 
> Plasticity 2D, 3rd order elements, N = 500,000
> Description: 9 component tangent defined on VectorQuadratureElement
> 
>          simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
> tensor     0.14s    0.55s      3.37s   4.22s  1.8MB  43.61s    54.5s   1.518
> old quad   0.14s    0.14s      1.69s   2.19s  414kb   7.58s    35.9s   1.000
> new quad   0.14s    0.15s      0.50s   1.00s  410kb   7.51s    10.5s   0.292
> 
> Note: For higher order elements, the code generated by tensor representation
>       grows in size increasing the DOLFIN compile time. The new quadrature
>       is 3 and 5 times faster than the old quadrature and tensor respectively.
>       The FFC compile time is also 2-4 times faster (not that it makes much of
>       a difference since the total compile time is only 1 sec.)
> 
> 
> Plasticity 3D, 1st order elements, N = 10,000,000
> Description: 36 component tangent defined on VectorQuadratureElement
> 
>          simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
> tensor     2.04s    3.36s      5.77s  11.89s  775kb  12.76s    52.9s   1.441
> old quad   2.04s    0.86s     11.71s  15.35s  670kb  11.72s    36.7s   1.000
> new quad   2.01s    0.85s      1.78s   5.33s  693kb  11.89s    19.0s   0.518
> 
> Note: The new quadrature compiles 2-3 times faster with FFC and is 2-3 times
>       faster at runtime.
> 
> 
> Plasticity 3D, 2nd order elements, N = 100,000
> Description: 36 component tangent defined on VectorQuadratureElement
> 
>          simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
> tensor     2.03s   34.93s    236.6s  275.30s  11MB     *        ---    ---
> old quad   2.05s    2.15s     68.3s   73.30s  1.4MB  16.89s    37.8s   1.000
> new quad   2.04s    2.15s      2.9s    7.82s  1.4MB  16.67s     6.7s   0.177
> 
> * ran out of memory after 8min.
>   cc1plus: out of memory allocating 1477058608 bytes after a total\
>    of 134725632 bytes
>   (also tried to split FFC output in *.h and *.cpp, same result)
> 
> Note: Tensor representation takes forever to compile with FFC and the
>       resulting code can't be compiled against DOLFIN. The new quadrature
>       compiles 10 times faster with FFC and runs about 5 times faster.
> 
> 
> PressureEquation 2D, 2nd order elements, N = 100,000
> Description: Many, many functions
> 
>          simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
> tensor    23.7s     0.45s      2.20s  29.0s   2.6MB  36.05s     6.76s  0.0168
> old quad  23.5s     0.41s     16.48s  43.1s   556kb   9.02s   400.40s  1.000
> new quad  23.7s     0.41s      3.05s  29.9s   544kb   8.69s     1.03s  0.0025
> 
> Note: The FFC compile time has been reduced for the new quadrature so that
>       it's comparable to that of tensor representation, note that most time
>       is spent by simplify. The runtime is now 6-7 times faster than tensor
>       representation which is almost 400!! times faster than the old version
>       of quadrature.
> 
> 
> BiharmonicDG_2D, 3rd order elements, N = 200,000
> Description: Interior facet integrals, higher order derivatives.
> 
>          simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
> tensor     1.11s    1.61s     13.67s  16.70s  3.2MB  46.26s    31.6s   0.280
> old quad   1.12s    1.26s      4.89s   7.64s  487kb   9.62s   112.9s   1.000
> new quad   1.12s    1.25s      2.95s   5.72s  427kb   7.80s    33.7s   0.298
> 
> Note: Faster compile time for both FFC and DOLFIN compared to tensor, and an
>       equivalent runtime performance.
>       (factor 3 better than the old quadrature)
> 
> 
> BiharmonicDG_3D, 3rd order elements, N = 2,000
> Description: Interior facet integrals, higher order derivatives.
> 
>          simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.
> tensor     2.70s     *         ---     ---    ---     ---       ---    ---
> old quad   2.70s    7.86s     60.5s   72.0s   2.9MB  70.2s     51.5s   1.000
> new quad   2.65s    7.79s     28.7s   39.9s   2.4MB  36.8s     10.4s   0.202
> 
> tensor     2.70s     *         ---     ---     ---    ---       ---    ---
> old quad   2.70s    7.86s     60.5s   72.0s   2.9MB  70.2s     51.5s   1.000
> new quad   2.65s    7.79s     28.7s   39.9s   2.4MB  36.8s     10.4s   0.202
> 
> * MemoryError during compute representation
> 
>   Note: A factor of 2 speed-up at the code generation stage, and less
>         code as output. 2 times faster DOLFIN compile time and 5 times faster
>         at runtime.
> 
> 
> DGSGPa, 3D linear elements, N = 20000
> Description: DG strain gradient plasticity form, among other crazy things
>              a 81 component tangent on linear discontinuous elements.
>                                                                                   
>          simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.  
> tensor     199s      *         ---     ---    ---     ---       ---    ---  
> old quad   200s     768s      1485s   2462s   11MB    220s     34.7s   1.000
> new quad   201s     763s       167s   1141s   9.0MB   114s     20.7s   0.628
> 
> * MemoryError during compute representation
> 
> Note: The FFC compile time has been reduced by a factor 2, also note that
>       the code generation is now faster than simplifying the expression. It
>       might be possible to optimise the representation stage by cutting some
>       corners, but that is for later. The DOLFIN compile time is a factor 2
>       faster, but unfortunately it did not have that big an impact on the
>       runtime performance.
> 
> 
> CahnHilliard,  Linear elements, N = 200000
> Description: Many functions.
>           simplify  repres.  code gen.  FFC    size   DOLFIN     run    fac.  
> old quad a   6.88s   11.5s    640s      ---     ---    ---       ---
> old quad L   2.98s  237.1s   5571s     6470s   1.9MB  23.2s     72.1s   1.000
> new quad a   6.61s   10.7s      2.30s   ---     ---    ---       ---    ---
> new quad L   3.14s  229.7s      1.63s   258s   1.9MB  20.8s      1.50s  0.021
> 
> Note: I'll let the numbers on FFC compile time and runtime speak for
>       themselves.
> 
> 
> Kristian

Very impressive!

-- 
Anders
Attachment: signature.asc
Description: Digital signature
References

quadrature optimisations
From: Kristian Oelgaard, 2008-09-08