ffc team mailing list archive

Thread
Date
quadrature optimisations

To: "ffc-dev@xxxxxxxxxx" <ffc-dev@xxxxxxxxxx>
From: Kristian Oelgaard <k.b.oelgaard@xxxxxxxxxx>
Date: Mon, 28 Apr 2008 20:25:53 +0200
Delivered-to: ffc-dev@xxxxxxxxxx
User-agent: Internet Messaging Program (IMP) 3.1

Hi,

I've pushed some optimisations for quadrature representation to the repository,
there might be a mail notification on its way. Below are some benchmarks where I
compare the performance of tensor representation to that of quadrature
representation. Comparison is made between FFC compile times, size of generated
code, DOLFIN compile time, and runtime.

I'm happy with a release anytime now although there's a lot of cleaning up to do
in the quadrature code generation files but I might get around to that before
the next release given that this will be in May/June sometime.

Kristian


DOLFIN change-set 4105
FFC    change-set 1119

The FFC and DOLFIN compile times were measured with the command
$ time

The run-time was measured as the time spent on assembly of the global stiffness
matrix once, i.e. time spent in the function assembleCells() in Assembler.cpp


# Form: Elasticity
element = VectorElement("Lagrange", "triangle", 1)

v = TestFunction(element)
u = TrialFunction(element)

E  = 10.0
nu = 0.3

mu    = E / (2*(1 + nu))
lmbda = E*nu / ((1 + nu)*(1 - 2*nu))

def epsilon(v):
    return 0.5*(grad(v) + transp(grad(v)))

def sigma(v):
    return 2*mu*epsilon(v) + lmbda*mult(trace(epsilon(v)), Identity(len(v)))

a = dot(grad(v), sigma(u))*dx

# 2D elasticity, cases
1) 1st order elements, mesh(1000, 1000)
2) 2nd order elements, mesh(500, 500)
3) 4th order elements, mesh(125, 125)
4) 6th order elements, mesh(64, 64)

# Results
               Tensor                         Quadrature
   FFC    size   DOLFIN  run_t |   FFC   size    DOLFIN    run_q  run_q/run_t
1) 0.39s  136kb   2.34s  5.35s |  0.44s  134kb    2.35s     5.66s    1.05
2) 0.49s  191kb   2.66s  6.35s |  0.52s  176kb    2.66s     8.31s    1.23
3) 0.99s  600kb   6.66s  2.89s |  0.95s  406kb    4.10s     5.81s    2.01
4) 3.81s  1.8Mb  32.10s  2.93s |  3.52s  1.1Mb    9.36s     9.27s    3.16


# 3D elasticity, cases
1) 1st order elements, mesh(40,40,40)
2) 2nd order elements, mesh(20,20,20)
3) 3rd order elements, mesh(10,10,10)
4) 4th order elements, mesh(5,5,5)

# Results
                Tensor                      Quadrature
   FFC    size   DOLFIN  run_t |  FFC    size   DOLFIN  run_q   run_q/run_t
1)  0.97s 279kb  3.19s   5.35s |  1.09s  267kb   3.18s   4.37s    1.09
2)  1.61s 700kb  6.30s   6.35s |  1.58s  478kb   4.59s   9.09s    1.63
3)  5.86s 2.3Mb 28.93s   2.89s |  5.03s  1.2Mb  10.08s  10.87s    2.64
4) 15.49s 6.8Mb  3m30s*   ---  | 11.49s  2.9Mb  31.57s   9.17s    ---

* I killed the compile after 3m30s but if it had finished I would expect
  the tensor code to perform better than quadrature

Form: Plasticity

element  = VectorElement("Lagrange", "triangle", 1)
elementT = VectorQuadratureElement("triangle", 1, 9)

v = TestFunction(element)
u = TrialFunction(element)
t = Function(elementT) # Consistent tangent coefficients

# Strain vector
def epsilon(u):
	return [u[0].dx(0), u[1].dx(1), u[0].dx(1) + u[1].dx(0)]

# Consistent tangent tensor
def tangent(t):
  return [[t[i*3 + j] for j in range(3)] for i in range(3)]

# Incremental stresses
def dsigma(t, u):
  return mult(tangent(t), epsilon(u))

a = dot(epsilon(v), dsigma(t, u) )*dx


# 2D plasticity, cases
1) 1st order elements, mesh(1000, 1000)
2) 2nd order elements, mesh(500, 500)
3) 3rd order elements, mesh(250, 250)
4) 4th order elements*, mesh(125, 125)

*Note that because the bilinear form in this case is a 9th order form we take
the number of quadrature points equal to 5 when declaring the quadrature element.

# Results
               Tensor                                Quadrature
    FFC    size    DOLFIN   run_t  |  FFC   size   DOLFIN    run_q   run_q/run_t
1)  0.65s   233kb   3.59s   6.57s  | 0.78s  230kb    3.52s     7.06s    1.07
2)  1.26s   497kb   6.24s  13.40s  | 1.36s  293kb    3.82s    12.70s    0.95
3)  4.26s   1.8Mb  46.74s  18.80s  | 2.25s  415kb    4.73s    14.15s    0.75
4) 26.15s  11.0Mb      *     ---   | 5.05s  687kb    6.68s    20.15s    ---

* I did not even try to compile this header file in DOLFIN


# 3D plasticity, cases
1) 1st order elements, mesh(40,40,40)
2) 2nd order elements, mesh(20,20,20)

# Results
         Tensor                          Quadrature
   FFC     size   DOLFIN  run_t |  FFC     size   DOLFIN    run_q   run_q/run_t
1) 12.26s  778kb   8.81s  7.52s |  16.25s  673kb    7.92s    6.69s    0.89
2) 4m39s  11.0Mb   11m*    ---  |  1m17s   1.4Mb   12.44s   24.80s**  ---

* I ran out of memory after 11 minutes of compiling DOLFIN.
** For comparison, this assembly would take 768s with the previous quadrature
   version, nearly a factor 30 speedup.


Conclusions

I general quadrature representation results in less code and a shorter compile
time for both FFC and DOLFIN. Complex forms (or forms with high element order)
that cannot be compiled with tensor representation can possibly be compiled
using the quadrature option. For forms with many coefficients, it even looks like
quadrature is faster than tensor representation.


For those who are interested, here's a rough sketch on how it works:


P[ip][num_dofs]: tabulated value of basisfunctions (or derivatives)
W[ip]:           weight at integration point
J_**:            Jacobian terms (constant for each element)

Using 2nd order Lagrange tets for elasticity, some typical terms for the standard
quadrature approach might look as follows:

// 8 integration points and 30 dofs
for (unsigned int ip = 0; ip < 8; ip++)
  for (unsigned int i = 0; i < 30; i++)
    for (unsigned int j = 0; j < 30; j++)
      A[j*30 + k] += P2[ip][j]*P0[ip][k]*W0[ip]*Jinv_00*Jinv_02*det\
                   + P1[ip][j]*P0[ip][k]*W1[ip]*Jinv_00*Jinv_01*det

The number of operations to compute the above is roughly 8*30*30*12 = 86400

First we precompute the constant terms, much like the geometry terms for the
tensor contraction (we do NOT tabulate coefficients w[.][.] ) Thus we have
G0 = Jinv_00*Jinv_02*det
G1 = Jinv_00*Jinv_01*det

Some of the tables with basisfunctions and quadrature are redundant, in this
case let P2 = P1 and W0 = W1, which means that our loop reduces to:
for (unsigned int ip = 0; ip < 8; ip++)
  for (unsigned int i = 0; i < 30; i++)
    for (unsigned int j = 0; j < 30; j++)
      A[j*30 + k] += P1[ip][j]*P0[ip][k]*W0[ip]*(G0 + G1)

which is already a factor of 3 operations less operations.
However, many of the tabulated values of the basis functions are zero. In the
case of gradients of 2nd order Lagrange elements on a tetrahedron there is only
7 non-zero columns in each table. We can tabulate these columns at compile time
e.g.,

nz0[7] = {0,1,2,3,4,5,6}
nz1[7] = {10,11,12,13,14,15,16}

which allow us to reduce our loop even further:
for (unsigned int ip = 0; ip < 8; ip++)
  for (unsigned int i = 0; i < 7; i++)
    for (unsigned int j = 0; j < 7; j++)
A[ nz1[j]*30 + nz0[k] ] += P1[ip][ nz1[j] ]*P0[ip][ nz0[k] ]*W0[ip]*(G0 + G1)

The number of operations is now 8*7*7*4 = 1568, a factor of 55 lower compared to
the initial value.
Note: The actual elasticity form was reduced from 7,754,400 to 119,952
operations.
Follow ups

Re: quadrature optimisations
From: Anders Logg, 2008-04-28
Re: quadrature optimisations
From: Anders Logg, 2008-04-28