ffc team mailing list archive

Thread
Date
FFC performance

To: Discussion of FFC development <ffc-dev@xxxxxxxxxx>
From: Anders Logg <logg@xxxxxxxxx>
Date: Wed, 10 May 2006 13:18:12 +0200
Mail-followup-to: Discussion of FFC development <ffc-dev@xxxxxxxxxx>
Sender: Anders Logg <logg@gwaihir>
User-agent: Mutt/1.5.11
On Mon, May 08, 2006 at 05:28:00PM +0200, Garth N. Wells wrote:
> On Mon, 2006-05-08 at 15:21 +0200, Anders Logg wrote:
> > On Mon, May 08, 2006 at 03:08:55PM +0200, Garth N. Wells wrote:
> > > Hi Anders,
> > > 
> > > We're using FFC here to compile some pretty complicated mixed nonlinear
> > > problems, and the FFC compile time (around 1 hour) is becoming an issue.
> > > Are there any areas in the code that are ripe for some optimisations
> > > (data structures, etc) that we could help with?  
> > > 
> > > Garth
> > 
> > Yes.
> > 
> > Check with -d0 how long it takes to compute the reference
> > tensor(s). There might be several in your case. This should be fairly
> > quick (I hope) since this has been optimized.
> > 
> > I suspect the main issue may be the actual code generation going on in
> > elementtensor.py, that is, generating the list of Declarations that
> > are later written to file in dolfin.py. I haven't tried to optimize
> > this part so it should be ripe for optimization.
> > 
> 
> This looks like the bottleneck. The reference tensors were typically
> computed in less than one second, with the longest being 12s. Below are
> lines of output from the profiler for both without and with the "-f
> blas" option.  More that 1/2 the run time is spent in
> __compute_element_tensor_default and dolfin.py when not using blas.
> 
> Garth
> 
> **** Without -f blas
> 
>          47310360 function calls (47307129 primitive calls) in 463.410
> CPU seconds
> 
>    Ordered by: internal time
> 
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         2  132.800   66.400  302.460  151.230                      
> 
> elementtensor.py:131(__compute_element_tensor_default)
>   7131790  121.390    0.000  141.110    0.000 dolfin.py:35(<lambda>)
>         8   40.440    5.055  136.120   17.015
> elementtensor.py:175(__check_used)
>   8083430   38.260    0.000   38.260    0.000 :0(join)
>   7114400   30.700    0.000   30.700    0.000
> referencetensor.py:154(__call__)
>  10675073   28.710    0.000   28.710    0.000 debug.py:14(debug)
>   7114406   25.290    0.000   25.290    0.000 :0(abs)>   3557202   15.440    0.000   15.440    0.000 dolfin.py:47(<lambda>)
>       857    5.420    0.006    5.420    0.006 :0(outer)
> 
> **** With -f blas
> 
> Generating XML output
> Output written to CahnHilliard2D-0.xml
> Output written to CahnHilliard2D-1.xml
>          24012820 function calls (24009589 primitive calls) in 282.910
> CPU seconds
> 
>    Ordered by: internal time
> 
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>         2   85.140   42.570  119.110   59.555 xml.py:70(__form)
>   3574590   60.310    0.000   70.320    0.000 dolfin.py:35(<lambda>)
>         8   40.760    5.095  138.210   17.276
> elementtensor.py:175(__check_used)
>   7303043   27.320    0.000   27.320    0.000 :0(join)
>   3557239   17.140    0.000   17.140    0.000 :0(write)
>   3557200   14.920    0.000   14.920    0.000
> referencetensor.py:154(__call__)
>   3557206   11.390    0.000   11.390    0.000 :0(abs)
>       857    5.330    0.006    5.330    0.006 :0(outer)
>     50720    4.170    0.000    8.440    0.000
> geometrytensor.py:78(__call__)
>        13    3.530    0.272    8.940    0.688
> monomialintegration.py:208(__compute_product)
>       133    2.950    0.022    5.130    0.039
> multiindex.py:12(build_indices)
> 763343/762858    1.990    0.000    2.000    0.000 :0(len)

Yes, it does seem that the code in question is the actual code
generation and not the computation. This could "easily" be optimized
by anyone with some Python experience but not necessarily knowing how
FFC works. It's just a matter of generating a long list of strings
from an existing data structure. It would be great if someone from
your group would be willing to look at it. I have done absolutely no
optimization of this part of the code so there should be some
opportunity for some big speedups.

Here's a simple sketch of the most time-consuming piece of code, that
is, __compute_element_tensor_default in elementtensor,py:

  for i in iindices:
       for a in aindices:
           a0 = A0(i, a)
           gk = format.format["geometry tensor"](j, a)
           name = format.format["element tensor"](i, k)
           value = format.format["sum"]([value, \
                   format.format["multiplication"]([ \
                   format.format["floating point"](a0), gk])])
           declarations += [Declaration(name, value)]

This iterates over a list of indices (iindices) and for each i
generates a line of code that is a sum of products. The coefficients
for the products are in the reference tensor A0 and the names are
generated by looking in the format dictionary which is stored in
dolfin.py. This is why dolfin.py shows up in the profiling.

One obvious optimization is to move the generation of the variable
"name" outside of the loop over a. Another would be to fetch the
formats
from the dictionary before the loop, so do something like

    format_sum = format.format["sum"]

before the loop and then use format_sum etc.

If someone wants to take a look and needs more directions or
clarifications, just ask.

/Anders