ffc team mailing list archive

Thread
Date
Re: quadrature optimisations

To: ffc-dev@xxxxxxxxxx
From: Anders Logg <logg@xxxxxxxxx>
Date: Mon, 28 Apr 2008 22:55:30 +0200
Delivered-to: ffc-dev@xxxxxxxxxx
In-reply-to: <1209407153.481616b160b41@mech001.citg.tudelft.nl>
Mail-followup-to: ffc-dev@xxxxxxxxxx
User-agent: Mutt/1.5.17+20080114 (2008-01-14)
On Mon, Apr 28, 2008 at 08:25:53PM +0200, Kristian Oelgaard wrote:
> 
> 
> Hi,
> 
> I've pushed some optimisations for quadrature representation to the repository,
> there might be a mail notification on its way. Below are some benchmarks where I
> compare the performance of tensor representation to that of quadrature
> representation. Comparison is made between FFC compile times, size of generated
> code, DOLFIN compile time, and runtime.
> 
> I'm happy with a release anytime now although there's a lot of cleaning up to do
> in the quadrature code generation files but I might get around to that before
> the next release given that this will be in May/June sometime.
> 
> Kristian
> 
> 
> DOLFIN change-set 4105
> FFC    change-set 1119
> 
> The FFC and DOLFIN compile times were measured with the command
> $ time
> 
> The run-time was measured as the time spent on assembly of the global stiffness
> matrix once, i.e. time spent in the function assembleCells() in Assembler.cpp
> 
> 
> # Form: Elasticity
> element = VectorElement("Lagrange", "triangle", 1)
> 
> v = TestFunction(element)
> u = TrialFunction(element)
> 
> E  = 10.0
> nu = 0.3
> 
> mu    = E / (2*(1 + nu))
> lmbda = E*nu / ((1 + nu)*(1 - 2*nu))
> 
> def epsilon(v):
>     return 0.5*(grad(v) + transp(grad(v)))
> 
> def sigma(v):
>     return 2*mu*epsilon(v) + lmbda*mult(trace(epsilon(v)), Identity(len(v)))
> 
> a = dot(grad(v), sigma(u))*dx
> 
> # 2D elasticity, cases
> 1) 1st order elements, mesh(1000, 1000)
> 2) 2nd order elements, mesh(500, 500)
> 3) 4th order elements, mesh(125, 125)
> 4) 6th order elements, mesh(64, 64)
> 
> # Results
>                Tensor                         Quadrature
>    FFC    size   DOLFIN  run_t |   FFC   size    DOLFIN    run_q  run_q/run_t
> 1) 0.39s  136kb   2.34s  5.35s |  0.44s  134kb    2.35s     5.66s    1.05
> 2) 0.49s  191kb   2.66s  6.35s |  0.52s  176kb    2.66s     8.31s    1.23
> 3) 0.99s  600kb   6.66s  2.89s |  0.95s  406kb    4.10s     5.81s    2.01
> 4) 3.81s  1.8Mb  32.10s  2.93s |  3.52s  1.1Mb    9.36s     9.27s    3.16
> 
> 
> # 3D elasticity, cases
> 1) 1st order elements, mesh(40,40,40)
> 2) 2nd order elements, mesh(20,20,20)
> 3) 3rd order elements, mesh(10,10,10)
> 4) 4th order elements, mesh(5,5,5)
> 
> # Results
>                 Tensor                      Quadrature
>    FFC    size   DOLFIN  run_t |  FFC    size   DOLFIN  run_q   run_q/run_t
> 1)  0.97s 279kb  3.19s   5.35s |  1.09s  267kb   3.18s   4.37s    1.09
> 2)  1.61s 700kb  6.30s   6.35s |  1.58s  478kb   4.59s   9.09s    1.63
> 3)  5.86s 2.3Mb 28.93s   2.89s |  5.03s  1.2Mb  10.08s  10.87s    2.64
> 4) 15.49s 6.8Mb  3m30s*   ---  | 11.49s  2.9Mb  31.57s   9.17s    ---
> 
> * I killed the compile after 3m30s but if it had finished I would expect
>   the tensor code to perform better than quadrature
> 
> Form: Plasticity
> 
> element  = VectorElement("Lagrange", "triangle", 1)
> elementT = VectorQuadratureElement("triangle", 1, 9)
> 
> v = TestFunction(element)
> u = TrialFunction(element)
> t = Function(elementT) # Consistent tangent coefficients
> 
> # Strain vector
> def epsilon(u):
> 	return [u[0].dx(0), u[1].dx(1), u[0].dx(1) + u[1].dx(0)]
> 
> # Consistent tangent tensor
> def tangent(t):
>   return [[t[i*3 + j] for j in range(3)] for i in range(3)]
> 
> # Incremental stresses
> def dsigma(t, u):
>   return mult(tangent(t), epsilon(u))
> 
> a = dot(epsilon(v), dsigma(t, u) )*dx
> 
> 
> # 2D plasticity, cases
> 1) 1st order elements, mesh(1000, 1000)
> 2) 2nd order elements, mesh(500, 500)
> 3) 3rd order elements, mesh(250, 250)
> 4) 4th order elements*, mesh(125, 125)
> 
> *Note that because the bilinear form in this case is a 9th order form we take
> the number of quadrature points equal to 5 when declaring the quadrature element.
> 
> # Results
>                Tensor                                Quadrature
>     FFC    size    DOLFIN   run_t  |  FFC   size   DOLFIN    run_q   run_q/run_t
> 1)  0.65s   233kb   3.59s   6.57s  | 0.78s  230kb    3.52s     7.06s    1.07
> 2)  1.26s   497kb   6.24s  13.40s  | 1.36s  293kb    3.82s    12.70s    0.95
> 3)  4.26s   1.8Mb  46.74s  18.80s  | 2.25s  415kb    4.73s    14.15s    0.75
> 4) 26.15s  11.0Mb      *     ---   | 5.05s  687kb    6.68s    20.15s    ---
> 
> * I did not even try to compile this header file in DOLFIN
> 
> 
> # 3D plasticity, cases
> 1) 1st order elements, mesh(40,40,40)
> 2) 2nd order elements, mesh(20,20,20)
> 
> # Results
>          Tensor                          Quadrature
>    FFC     size   DOLFIN  run_t |  FFC     size   DOLFIN    run_q   run_q/run_t
> 1) 12.26s  778kb   8.81s  7.52s |  16.25s  673kb    7.92s    6.69s    0.89
> 2) 4m39s  11.0Mb   11m*    ---  |  1m17s   1.4Mb   12.44s   24.80s**  ---
> 
> * I ran out of memory after 11 minutes of compiling DOLFIN.
> ** For comparison, this assembly would take 768s with the previous quadrature
>    version, nearly a factor 30 speedup.
> 
> 
> Conclusions
> 
> I general quadrature representation results in less code and a shorter compile
> time for both FFC and DOLFIN. Complex forms (or forms with high element order)
> that cannot be compiled with tensor representation can possibly be compiled
> using the quadrature option. For forms with many coefficients, it even looks like
> quadrature is faster than tensor representation.

Very nice! I'd like to add that the actual speedups, both when tensor
representation is faster and when quadrature is faster, are much
larger than the factor 1-3 you see here. Since you measure the total
time including interaction with the mesh, computing the dof map,
inserting into the matrix (which is the same for both), you only see
part of the speedup. This means that for simple forms like 2D
elasticity, the speedup (for tabulate_tensor) may very well be a
factor 1000, but you only see a factor 3. Same goes for complicated
forms when quadrature is faster.

> For those who are interested, here's a rough sketch on how it works:
> 
> 
> P[ip][num_dofs]: tabulated value of basisfunctions (or derivatives)
> W[ip]:           weight at integration point
> J_**:            Jacobian terms (constant for each element)
> 
> Using 2nd order Lagrange tets for elasticity, some typical terms for the standard
> quadrature approach might look as follows:
> 
> // 8 integration points and 30 dofs
> for (unsigned int ip = 0; ip < 8; ip++)
>   for (unsigned int i = 0; i < 30; i++)
>     for (unsigned int j = 0; j < 30; j++)
>       A[j*30 + k] += P2[ip][j]*P0[ip][k]*W0[ip]*Jinv_00*Jinv_02*det\
>                    + P1[ip][j]*P0[ip][k]*W1[ip]*Jinv_00*Jinv_01*det
> 
> The number of operations to compute the above is roughly 8*30*30*12 = 86400
> 
> First we precompute the constant terms, much like the geometry terms for the
> tensor contraction (we do NOT tabulate coefficients w[.][.] ) Thus we have
> G0 = Jinv_00*Jinv_02*det
> G1 = Jinv_00*Jinv_01*det
> 
> Some of the tables with basisfunctions and quadrature are redundant, in this
> case let P2 = P1 and W0 = W1, which means that our loop reduces to:
> for (unsigned int ip = 0; ip < 8; ip++)
>   for (unsigned int i = 0; i < 30; i++)
>     for (unsigned int j = 0; j < 30; j++)
>       A[j*30 + k] += P1[ip][j]*P0[ip][k]*W0[ip]*(G0 + G1)
> 
> which is already a factor of 3 operations less operations.
> However, many of the tabulated values of the basis functions are zero. In the
> case of gradients of 2nd order Lagrange elements on a tetrahedron there is only
> 7 non-zero columns in each table. We can tabulate these columns at compile time
> e.g.,
> 
> nz0[7] = {0,1,2,3,4,5,6}
> nz1[7] = {10,11,12,13,14,15,16}
> 
> which allow us to reduce our loop even further:
> for (unsigned int ip = 0; ip < 8; ip++)
>   for (unsigned int i = 0; i < 7; i++)
>     for (unsigned int j = 0; j < 7; j++)
> A[ nz1[j]*30 + nz0[k] ] += P1[ip][ nz1[j] ]*P0[ip][ nz0[k] ]*W0[ip]*(G0 + G1)
> 
> The number of operations is now 8*7*7*4 = 1568, a factor of 55 lower compared to
> the initial value.
> Note: The actual elasticity form was reduced from 7,754,400 to 119,952
> operations.

Impressive. At some point, we might want to consider setting
quadrature as the default option in FFC, since it obviously handles
more complex forms. The tensor representation can be invoked with
a -O flag.

-- 
Anders
Follow ups

Re: quadrature optimisations
From: Garth N. Wells, 2008-04-29
References

quadrature optimisations
From: Kristian Oelgaard, 2008-04-28