ffc team mailing list archive
-
ffc team
-
Mailing list archive
-
Message #00600
Re: FFC performance
Likely the problem is in string and list concatenation, which have a tendency to
run in O(n^2) rather than O(n) time in Python.
---- Original message ----
>Date: Wed, 10 May 2006 13:18:12 +0200
>From: Anders Logg <logg@xxxxxxxxx>
>Subject: [FFC-dev] FFC performance
>To: Discussion of FFC development <ffc-dev@xxxxxxxxxx>
>
>On Mon, May 08, 2006 at 05:28:00PM +0200, Garth N. Wells wrote:
>> On Mon, 2006-05-08 at 15:21 +0200, Anders Logg wrote:
>> > On Mon, May 08, 2006 at 03:08:55PM +0200, Garth N. Wells wrote:
>> > > Hi Anders,
>> > >
>> > > We're using FFC here to compile some pretty complicated mixed
nonlinear
>> > > problems, and the FFC compile time (around 1 hour) is becoming an
issue.
>> > > Are there any areas in the code that are ripe for some optimisations
>> > > (data structures, etc) that we could help with?
>> > >
>> > > Garth
>> >
>> > Yes.
>> >
>> > Check with -d0 how long it takes to compute the reference
>> > tensor(s). There might be several in your case. This should be fairly
>> > quick (I hope) since this has been optimized.
>> >
>> > I suspect the main issue may be the actual code generation going on in
>> > elementtensor.py, that is, generating the list of Declarations that
>> > are later written to file in dolfin.py. I haven't tried to optimize
>> > this part so it should be ripe for optimization.
>> >
>>
>> This looks like the bottleneck. The reference tensors were typically
>> computed in less than one second, with the longest being 12s. Below are
>> lines of output from the profiler for both without and with the "-f
>> blas" option. More that 1/2 the run time is spent in
>> __compute_element_tensor_default and dolfin.py when not using blas.
>>
>> Garth
>>
>> **** Without -f blas
>>
>> 47310360 function calls (47307129 primitive calls) in 463.410
>> CPU seconds
>>
>> Ordered by: internal time
>>
>> ncalls tottime percall cumtime percall filename:lineno(function)
>> 2 132.800 66.400 302.460 151.230
>>
>> elementtensor.py:131(__compute_element_tensor_default)
>> 7131790 121.390 0.000 141.110 0.000 dolfin.py:35(<lambda>)
>> 8 40.440 5.055 136.120 17.015
>> elementtensor.py:175(__check_used)
>> 8083430 38.260 0.000 38.260 0.000 :0(join)
>> 7114400 30.700 0.000 30.700 0.000
>> referencetensor.py:154(__call__)
>> 10675073 28.710 0.000 28.710 0.000 debug.py:14(debug)
>> 7114406 25.290 0.000 25.290 0.000 :0(abs)> 3557202 15.440
0.000 15.440 0.000 dolfin.py:47(<lambda>)
>> 857 5.420 0.006 5.420 0.006 :0(outer)
>>
>> **** With -f blas
>>
>> Generating XML output
>> Output written to CahnHilliard2D-0.xml
>> Output written to CahnHilliard2D-1.xml
>> 24012820 function calls (24009589 primitive calls) in 282.910
>> CPU seconds
>>
>> Ordered by: internal time
>>
>> ncalls tottime percall cumtime percall filename:lineno(function)
>> 2 85.140 42.570 119.110 59.555 xml.py:70(__form)
>> 3574590 60.310 0.000 70.320 0.000 dolfin.py:35(<lambda>)
>> 8 40.760 5.095 138.210 17.276
>> elementtensor.py:175(__check_used)
>> 7303043 27.320 0.000 27.320 0.000 :0(join)
>> 3557239 17.140 0.000 17.140 0.000 :0(write)
>> 3557200 14.920 0.000 14.920 0.000
>> referencetensor.py:154(__call__)
>> 3557206 11.390 0.000 11.390 0.000 :0(abs)
>> 857 5.330 0.006 5.330 0.006 :0(outer)
>> 50720 4.170 0.000 8.440 0.000
>> geometrytensor.py:78(__call__)
>> 13 3.530 0.272 8.940 0.688
>> monomialintegration.py:208(__compute_product)
>> 133 2.950 0.022 5.130 0.039
>> multiindex.py:12(build_indices)
>> 763343/762858 1.990 0.000 2.000 0.000 :0(len)
>
>Yes, it does seem that the code in question is the actual code
>generation and not the computation. This could "easily" be optimized
>by anyone with some Python experience but not necessarily knowing how
>FFC works. It's just a matter of generating a long list of strings
>from an existing data structure. It would be great if someone from
>your group would be willing to look at it. I have done absolutely no
>optimization of this part of the code so there should be some
>opportunity for some big speedups.
>
>Here's a simple sketch of the most time-consuming piece of code, that
>is, __compute_element_tensor_default in elementtensor,py:
>
> for i in iindices:
> for a in aindices:
> a0 = A0(i, a)
> gk = format.format["geometry tensor"](j, a)
> name = format.format["element tensor"](i, k)
> value = format.format["sum"]([value, \
> format.format["multiplication"]([ \
> format.format["floating point"](a0), gk])])
> declarations += [Declaration(name, value)]
>
>This iterates over a list of indices (iindices) and for each i
>generates a line of code that is a sum of products. The coefficients
>for the products are in the reference tensor A0 and the names are
>generated by looking in the format dictionary which is stored in
>dolfin.py. This is why dolfin.py shows up in the profiling.
>
>One obvious optimization is to move the generation of the variable
>"name" outside of the loop over a. Another would be to fetch the
>formats
>from the dictionary before the loop, so do something like
>
> format_sum = format.format["sum"]
>
>before the loop and then use format_sum etc.
>
>If someone wants to take a look and needs more directions or
>clarifications, just ask.
>
>/Anders
>
>_______________________________________________
>FFC-dev mailing list
>FFC-dev@xxxxxxxxxx
>http://www.fenics.org/cgi-bin/mailman/listinfo/ffc-dev