ffc team mailing list archive
-
ffc team
-
Mailing list archive
-
Message #01786
quadrature optimisations
Hi,
Here is a comparison between tensor representation and the previous
quadrature representation and the new and optimised version of quadrature
representation.
The FFC compile time is measured as follows:
- simplify, the time spent on simplifying the expression
- repres., the time spent on computing the representation
- code gen., the time spent on actual code generation
- FFC, total time spent on compiling the form
The 3 stages (simplify, repres. and code gen.) accounts for around 95% of
the FFC compile time.
- size, is the size of the header file.
- DOLFIN, is the time spent on compiling a simple main.cpp file including
the generated header file against DOLFIN.
- run, is the runtime measured as the time it takes to call tabulate_tensor()
N times. No assembly is performed. If a form contains facet integrals
tabulate_tensor() is called for each of the cases. E.g., a DG form in
3D with one interior facet integral will call tabulate_tensor()
N*4*4 times.
- fac., is the runtime divided by the runtime for the previous version of
quadrature representation
All forms are bilinear forms.
Elasticity 3D, 2nd order elements, N = 500,000
Description: No functions, just basisfunctions and geometry terms
simplify repres. code gen. FFC size DOLFIN run fac.
tensor 0.21s 0.05s 0.93s 1.41s 598kb 8.17s 1.8s 0.047
old quad 0.21s 0.05s 0.98s 1.45s 476kb 7.39s 38.5s 1.000
new quad 0.21s 0.05s 0.98s 1.48s 464kb 7.34s 31.0s 0.805
Note: For forms without any functions tensor reprentation is ALWAYS much
faster. (about 17 times in this case)
Plasticity 2D, 1st order elements, N = 100,000,000
Description: 9 component tangent defined on VectorQuadratureElement
simplify repres. code gen. FFC size DOLFIN run fac.
tensor 0.14s 0.11s 0.20s 0.62s 232kb 6.36s 25.5s 0.560
old quad 0.14s 0.06s 0.37s 0.77s 228kb 6.33s 45.5s 1.000
new quad 0.14s 0.06s 0.24s 0.64s 230kb 6.27s 20.8s 0.457
Note: Not much difference between tensor and the new quadrauture
representation, both are about 2 times faster than the old version of
quadrature representation.
Plasticity 2D, 3rd order elements, N = 500,000
Description: 9 component tangent defined on VectorQuadratureElement
simplify repres. code gen. FFC size DOLFIN run fac.
tensor 0.14s 0.55s 3.37s 4.22s 1.8MB 43.61s 54.5s 1.518
old quad 0.14s 0.14s 1.69s 2.19s 414kb 7.58s 35.9s 1.000
new quad 0.14s 0.15s 0.50s 1.00s 410kb 7.51s 10.5s 0.292
Note: For higher order elements, the code generated by tensor representation
grows in size increasing the DOLFIN compile time. The new quadrature
is 3 and 5 times faster than the old quadrature and tensor respectively.
The FFC compile time is also 2-4 times faster (not that it makes much of
a difference since the total compile time is only 1 sec.)
Plasticity 3D, 1st order elements, N = 10,000,000
Description: 36 component tangent defined on VectorQuadratureElement
simplify repres. code gen. FFC size DOLFIN run fac.
tensor 2.04s 3.36s 5.77s 11.89s 775kb 12.76s 52.9s 1.441
old quad 2.04s 0.86s 11.71s 15.35s 670kb 11.72s 36.7s 1.000
new quad 2.01s 0.85s 1.78s 5.33s 693kb 11.89s 19.0s 0.518
Note: The new quadrature compiles 2-3 times faster with FFC and is 2-3 times
faster at runtime.
Plasticity 3D, 2nd order elements, N = 100,000
Description: 36 component tangent defined on VectorQuadratureElement
simplify repres. code gen. FFC size DOLFIN run fac.
tensor 2.03s 34.93s 236.6s 275.30s 11MB * --- ---
old quad 2.05s 2.15s 68.3s 73.30s 1.4MB 16.89s 37.8s 1.000
new quad 2.04s 2.15s 2.9s 7.82s 1.4MB 16.67s 6.7s 0.177
* ran out of memory after 8min.
cc1plus: out of memory allocating 1477058608 bytes after a total\
of 134725632 bytes
(also tried to split FFC output in *.h and *.cpp, same result)
Note: Tensor representation takes forever to compile with FFC and the
resulting code can't be compiled against DOLFIN. The new quadrature
compiles 10 times faster with FFC and runs about 5 times faster.
PressureEquation 2D, 2nd order elements, N = 100,000
Description: Many, many functions
simplify repres. code gen. FFC size DOLFIN run fac.
tensor 23.7s 0.45s 2.20s 29.0s 2.6MB 36.05s 6.76s 0.0168
old quad 23.5s 0.41s 16.48s 43.1s 556kb 9.02s 400.40s 1.000
new quad 23.7s 0.41s 3.05s 29.9s 544kb 8.69s 1.03s 0.0025
Note: The FFC compile time has been reduced for the new quadrature so that
it's comparable to that of tensor representation, note that most time
is spent by simplify. The runtime is now 6-7 times faster than tensor
representation which is almost 400!! times faster than the old version
of quadrature.
BiharmonicDG_2D, 3rd order elements, N = 200,000
Description: Interior facet integrals, higher order derivatives.
simplify repres. code gen. FFC size DOLFIN run fac.
tensor 1.11s 1.61s 13.67s 16.70s 3.2MB 46.26s 31.6s 0.280
old quad 1.12s 1.26s 4.89s 7.64s 487kb 9.62s 112.9s 1.000
new quad 1.12s 1.25s 2.95s 5.72s 427kb 7.80s 33.7s 0.298
Note: Faster compile time for both FFC and DOLFIN compared to tensor, and an
equivalent runtime performance.
(factor 3 better than the old quadrature)
BiharmonicDG_3D, 3rd order elements, N = 2,000
Description: Interior facet integrals, higher order derivatives.
simplify repres. code gen. FFC size DOLFIN run fac.
tensor 2.70s * --- --- --- --- --- ---
old quad 2.70s 7.86s 60.5s 72.0s 2.9MB 70.2s 51.5s 1.000
new quad 2.65s 7.79s 28.7s 39.9s 2.4MB 36.8s 10.4s 0.202
tensor 2.70s * --- --- --- --- --- ---
old quad 2.70s 7.86s 60.5s 72.0s 2.9MB 70.2s 51.5s 1.000
new quad 2.65s 7.79s 28.7s 39.9s 2.4MB 36.8s 10.4s 0.202
* MemoryError during compute representation
Note: A factor of 2 speed-up at the code generation stage, and less
code as output. 2 times faster DOLFIN compile time and 5 times faster
at runtime.
DGSGPa, 3D linear elements, N = 20000
Description: DG strain gradient plasticity form, among other crazy things
a 81 component tangent on linear discontinuous elements.
simplify repres. code gen. FFC size DOLFIN run fac.
tensor 199s * --- --- --- --- --- ---
old quad 200s 768s 1485s 2462s 11MB 220s 34.7s 1.000
new quad 201s 763s 167s 1141s 9.0MB 114s 20.7s 0.628
* MemoryError during compute representation
Note: The FFC compile time has been reduced by a factor 2, also note that
the code generation is now faster than simplifying the expression. It
might be possible to optimise the representation stage by cutting some
corners, but that is for later. The DOLFIN compile time is a factor 2
faster, but unfortunately it did not have that big an impact on the
runtime performance.
CahnHilliard, Linear elements, N = 200000
Description: Many functions.
simplify repres. code gen. FFC size DOLFIN run fac.
old quad a 6.88s 11.5s 640s --- --- --- ---
old quad L 2.98s 237.1s 5571s 6470s 1.9MB 23.2s 72.1s 1.000
new quad a 6.61s 10.7s 2.30s --- --- --- --- ---
new quad L 3.14s 229.7s 1.63s 258s 1.9MB 20.8s 1.50s 0.021
Note: I'll let the numbers on FFC compile time and runtime speak for
themselves.
Kristian
Follow ups