ffc team mailing list archive
-
ffc team
-
Mailing list archive
-
Message #01574
Re: quadrature optimisations
On Mon, Apr 28, 2008 at 08:25:53PM +0200, Kristian Oelgaard wrote:
>
>
> Hi,
>
> I've pushed some optimisations for quadrature representation to the repository,
> there might be a mail notification on its way. Below are some benchmarks where I
> compare the performance of tensor representation to that of quadrature
> representation. Comparison is made between FFC compile times, size of generated
> code, DOLFIN compile time, and runtime.
>
> I'm happy with a release anytime now although there's a lot of cleaning up to do
> in the quadrature code generation files but I might get around to that before
> the next release given that this will be in May/June sometime.
>
> Kristian
>
>
> DOLFIN change-set 4105
> FFC change-set 1119
>
> The FFC and DOLFIN compile times were measured with the command
> $ time
>
> The run-time was measured as the time spent on assembly of the global stiffness
> matrix once, i.e. time spent in the function assembleCells() in Assembler.cpp
>
>
> # Form: Elasticity
> element = VectorElement("Lagrange", "triangle", 1)
>
> v = TestFunction(element)
> u = TrialFunction(element)
>
> E = 10.0
> nu = 0.3
>
> mu = E / (2*(1 + nu))
> lmbda = E*nu / ((1 + nu)*(1 - 2*nu))
>
> def epsilon(v):
> return 0.5*(grad(v) + transp(grad(v)))
>
> def sigma(v):
> return 2*mu*epsilon(v) + lmbda*mult(trace(epsilon(v)), Identity(len(v)))
>
> a = dot(grad(v), sigma(u))*dx
>
> # 2D elasticity, cases
> 1) 1st order elements, mesh(1000, 1000)
> 2) 2nd order elements, mesh(500, 500)
> 3) 4th order elements, mesh(125, 125)
> 4) 6th order elements, mesh(64, 64)
>
> # Results
> Tensor Quadrature
> FFC size DOLFIN run_t | FFC size DOLFIN run_q run_q/run_t
> 1) 0.39s 136kb 2.34s 5.35s | 0.44s 134kb 2.35s 5.66s 1.05
> 2) 0.49s 191kb 2.66s 6.35s | 0.52s 176kb 2.66s 8.31s 1.23
> 3) 0.99s 600kb 6.66s 2.89s | 0.95s 406kb 4.10s 5.81s 2.01
> 4) 3.81s 1.8Mb 32.10s 2.93s | 3.52s 1.1Mb 9.36s 9.27s 3.16
>
>
> # 3D elasticity, cases
> 1) 1st order elements, mesh(40,40,40)
> 2) 2nd order elements, mesh(20,20,20)
> 3) 3rd order elements, mesh(10,10,10)
> 4) 4th order elements, mesh(5,5,5)
>
> # Results
> Tensor Quadrature
> FFC size DOLFIN run_t | FFC size DOLFIN run_q run_q/run_t
> 1) 0.97s 279kb 3.19s 5.35s | 1.09s 267kb 3.18s 4.37s 1.09
> 2) 1.61s 700kb 6.30s 6.35s | 1.58s 478kb 4.59s 9.09s 1.63
> 3) 5.86s 2.3Mb 28.93s 2.89s | 5.03s 1.2Mb 10.08s 10.87s 2.64
> 4) 15.49s 6.8Mb 3m30s* --- | 11.49s 2.9Mb 31.57s 9.17s ---
>
> * I killed the compile after 3m30s but if it had finished I would expect
> the tensor code to perform better than quadrature
>
> Form: Plasticity
>
> element = VectorElement("Lagrange", "triangle", 1)
> elementT = VectorQuadratureElement("triangle", 1, 9)
>
> v = TestFunction(element)
> u = TrialFunction(element)
> t = Function(elementT) # Consistent tangent coefficients
>
> # Strain vector
> def epsilon(u):
> return [u[0].dx(0), u[1].dx(1), u[0].dx(1) + u[1].dx(0)]
>
> # Consistent tangent tensor
> def tangent(t):
> return [[t[i*3 + j] for j in range(3)] for i in range(3)]
>
> # Incremental stresses
> def dsigma(t, u):
> return mult(tangent(t), epsilon(u))
>
> a = dot(epsilon(v), dsigma(t, u) )*dx
>
>
> # 2D plasticity, cases
> 1) 1st order elements, mesh(1000, 1000)
> 2) 2nd order elements, mesh(500, 500)
> 3) 3rd order elements, mesh(250, 250)
> 4) 4th order elements*, mesh(125, 125)
>
> *Note that because the bilinear form in this case is a 9th order form we take
> the number of quadrature points equal to 5 when declaring the quadrature element.
>
> # Results
> Tensor Quadrature
> FFC size DOLFIN run_t | FFC size DOLFIN run_q run_q/run_t
> 1) 0.65s 233kb 3.59s 6.57s | 0.78s 230kb 3.52s 7.06s 1.07
> 2) 1.26s 497kb 6.24s 13.40s | 1.36s 293kb 3.82s 12.70s 0.95
> 3) 4.26s 1.8Mb 46.74s 18.80s | 2.25s 415kb 4.73s 14.15s 0.75
> 4) 26.15s 11.0Mb * --- | 5.05s 687kb 6.68s 20.15s ---
>
> * I did not even try to compile this header file in DOLFIN
>
>
> # 3D plasticity, cases
> 1) 1st order elements, mesh(40,40,40)
> 2) 2nd order elements, mesh(20,20,20)
>
> # Results
> Tensor Quadrature
> FFC size DOLFIN run_t | FFC size DOLFIN run_q run_q/run_t
> 1) 12.26s 778kb 8.81s 7.52s | 16.25s 673kb 7.92s 6.69s 0.89
> 2) 4m39s 11.0Mb 11m* --- | 1m17s 1.4Mb 12.44s 24.80s** ---
>
> * I ran out of memory after 11 minutes of compiling DOLFIN.
> ** For comparison, this assembly would take 768s with the previous quadrature
> version, nearly a factor 30 speedup.
>
>
> Conclusions
>
> I general quadrature representation results in less code and a shorter compile
> time for both FFC and DOLFIN. Complex forms (or forms with high element order)
> that cannot be compiled with tensor representation can possibly be compiled
> using the quadrature option. For forms with many coefficients, it even looks like
> quadrature is faster than tensor representation.
Very nice! I'd like to add that the actual speedups, both when tensor
representation is faster and when quadrature is faster, are much
larger than the factor 1-3 you see here. Since you measure the total
time including interaction with the mesh, computing the dof map,
inserting into the matrix (which is the same for both), you only see
part of the speedup. This means that for simple forms like 2D
elasticity, the speedup (for tabulate_tensor) may very well be a
factor 1000, but you only see a factor 3. Same goes for complicated
forms when quadrature is faster.
> For those who are interested, here's a rough sketch on how it works:
>
>
> P[ip][num_dofs]: tabulated value of basisfunctions (or derivatives)
> W[ip]: weight at integration point
> J_**: Jacobian terms (constant for each element)
>
> Using 2nd order Lagrange tets for elasticity, some typical terms for the standard
> quadrature approach might look as follows:
>
> // 8 integration points and 30 dofs
> for (unsigned int ip = 0; ip < 8; ip++)
> for (unsigned int i = 0; i < 30; i++)
> for (unsigned int j = 0; j < 30; j++)
> A[j*30 + k] += P2[ip][j]*P0[ip][k]*W0[ip]*Jinv_00*Jinv_02*det\
> + P1[ip][j]*P0[ip][k]*W1[ip]*Jinv_00*Jinv_01*det
>
> The number of operations to compute the above is roughly 8*30*30*12 = 86400
>
> First we precompute the constant terms, much like the geometry terms for the
> tensor contraction (we do NOT tabulate coefficients w[.][.] ) Thus we have
> G0 = Jinv_00*Jinv_02*det
> G1 = Jinv_00*Jinv_01*det
>
> Some of the tables with basisfunctions and quadrature are redundant, in this
> case let P2 = P1 and W0 = W1, which means that our loop reduces to:
> for (unsigned int ip = 0; ip < 8; ip++)
> for (unsigned int i = 0; i < 30; i++)
> for (unsigned int j = 0; j < 30; j++)
> A[j*30 + k] += P1[ip][j]*P0[ip][k]*W0[ip]*(G0 + G1)
>
> which is already a factor of 3 operations less operations.
> However, many of the tabulated values of the basis functions are zero. In the
> case of gradients of 2nd order Lagrange elements on a tetrahedron there is only
> 7 non-zero columns in each table. We can tabulate these columns at compile time
> e.g.,
>
> nz0[7] = {0,1,2,3,4,5,6}
> nz1[7] = {10,11,12,13,14,15,16}
>
> which allow us to reduce our loop even further:
> for (unsigned int ip = 0; ip < 8; ip++)
> for (unsigned int i = 0; i < 7; i++)
> for (unsigned int j = 0; j < 7; j++)
> A[ nz1[j]*30 + nz0[k] ] += P1[ip][ nz1[j] ]*P0[ip][ nz0[k] ]*W0[ip]*(G0 + G1)
>
> The number of operations is now 8*7*7*4 = 1568, a factor of 55 lower compared to
> the initial value.
> Note: The actual elasticity form was reduced from 7,754,400 to 119,952
> operations.
Impressive. At some point, we might want to consider setting
quadrature as the default option in FFC, since it obviously handles
more complex forms. The tensor representation can be invoked with
a -O flag.
--
Anders
Follow ups
References