ffc team mailing list archive
-
ffc team
-
Mailing list archive
-
Message #00599
FFC performance
-
To:
Discussion of FFC development <ffc-dev@xxxxxxxxxx>
-
From:
Anders Logg <logg@xxxxxxxxx>
-
Date:
Wed, 10 May 2006 13:18:12 +0200
-
Mail-followup-to:
Discussion of FFC development <ffc-dev@xxxxxxxxxx>
-
Sender:
Anders Logg <logg@gwaihir>
-
User-agent:
Mutt/1.5.11
On Mon, May 08, 2006 at 05:28:00PM +0200, Garth N. Wells wrote:
> On Mon, 2006-05-08 at 15:21 +0200, Anders Logg wrote:
> > On Mon, May 08, 2006 at 03:08:55PM +0200, Garth N. Wells wrote:
> > > Hi Anders,
> > >
> > > We're using FFC here to compile some pretty complicated mixed nonlinear
> > > problems, and the FFC compile time (around 1 hour) is becoming an issue.
> > > Are there any areas in the code that are ripe for some optimisations
> > > (data structures, etc) that we could help with?
> > >
> > > Garth
> >
> > Yes.
> >
> > Check with -d0 how long it takes to compute the reference
> > tensor(s). There might be several in your case. This should be fairly
> > quick (I hope) since this has been optimized.
> >
> > I suspect the main issue may be the actual code generation going on in
> > elementtensor.py, that is, generating the list of Declarations that
> > are later written to file in dolfin.py. I haven't tried to optimize
> > this part so it should be ripe for optimization.
> >
>
> This looks like the bottleneck. The reference tensors were typically
> computed in less than one second, with the longest being 12s. Below are
> lines of output from the profiler for both without and with the "-f
> blas" option. More that 1/2 the run time is spent in
> __compute_element_tensor_default and dolfin.py when not using blas.
>
> Garth
>
> **** Without -f blas
>
> 47310360 function calls (47307129 primitive calls) in 463.410
> CPU seconds
>
> Ordered by: internal time
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 2 132.800 66.400 302.460 151.230
>
> elementtensor.py:131(__compute_element_tensor_default)
> 7131790 121.390 0.000 141.110 0.000 dolfin.py:35(<lambda>)
> 8 40.440 5.055 136.120 17.015
> elementtensor.py:175(__check_used)
> 8083430 38.260 0.000 38.260 0.000 :0(join)
> 7114400 30.700 0.000 30.700 0.000
> referencetensor.py:154(__call__)
> 10675073 28.710 0.000 28.710 0.000 debug.py:14(debug)
> 7114406 25.290 0.000 25.290 0.000 :0(abs)> 3557202 15.440 0.000 15.440 0.000 dolfin.py:47(<lambda>)
> 857 5.420 0.006 5.420 0.006 :0(outer)
>
> **** With -f blas
>
> Generating XML output
> Output written to CahnHilliard2D-0.xml
> Output written to CahnHilliard2D-1.xml
> 24012820 function calls (24009589 primitive calls) in 282.910
> CPU seconds
>
> Ordered by: internal time
>
> ncalls tottime percall cumtime percall filename:lineno(function)
> 2 85.140 42.570 119.110 59.555 xml.py:70(__form)
> 3574590 60.310 0.000 70.320 0.000 dolfin.py:35(<lambda>)
> 8 40.760 5.095 138.210 17.276
> elementtensor.py:175(__check_used)
> 7303043 27.320 0.000 27.320 0.000 :0(join)
> 3557239 17.140 0.000 17.140 0.000 :0(write)
> 3557200 14.920 0.000 14.920 0.000
> referencetensor.py:154(__call__)
> 3557206 11.390 0.000 11.390 0.000 :0(abs)
> 857 5.330 0.006 5.330 0.006 :0(outer)
> 50720 4.170 0.000 8.440 0.000
> geometrytensor.py:78(__call__)
> 13 3.530 0.272 8.940 0.688
> monomialintegration.py:208(__compute_product)
> 133 2.950 0.022 5.130 0.039
> multiindex.py:12(build_indices)
> 763343/762858 1.990 0.000 2.000 0.000 :0(len)
Yes, it does seem that the code in question is the actual code
generation and not the computation. This could "easily" be optimized
by anyone with some Python experience but not necessarily knowing how
FFC works. It's just a matter of generating a long list of strings
from an existing data structure. It would be great if someone from
your group would be willing to look at it. I have done absolutely no
optimization of this part of the code so there should be some
opportunity for some big speedups.
Here's a simple sketch of the most time-consuming piece of code, that
is, __compute_element_tensor_default in elementtensor,py:
for i in iindices:
for a in aindices:
a0 = A0(i, a)
gk = format.format["geometry tensor"](j, a)
name = format.format["element tensor"](i, k)
value = format.format["sum"]([value, \
format.format["multiplication"]([ \
format.format["floating point"](a0), gk])])
declarations += [Declaration(name, value)]
This iterates over a list of indices (iindices) and for each i
generates a line of code that is a sum of products. The coefficients
for the products are in the reference tensor A0 and the names are
generated by looking in the format dictionary which is stored in
dolfin.py. This is why dolfin.py shows up in the profiling.
One obvious optimization is to move the generation of the variable
"name" outside of the loop over a. Another would be to fetch the
formats
from the dictionary before the loop, so do something like
format_sum = format.format["sum"]
before the loop and then use format_sum etc.
If someone wants to take a look and needs more directions or
clarifications, just ask.
/Anders