← Back to team overview

ffc team mailing list archive

Re: FFC performance

 

Likely the problem is in string and list concatenation, which have a tendency to 
run in O(n^2) rather than O(n) time in Python.

---- Original message ----
>Date: Wed, 10 May 2006 13:18:12 +0200
>From: Anders Logg <logg@xxxxxxxxx>  
>Subject: [FFC-dev] FFC performance  
>To: Discussion of FFC development <ffc-dev@xxxxxxxxxx>
>
>On Mon, May 08, 2006 at 05:28:00PM +0200, Garth N. Wells wrote:
>> On Mon, 2006-05-08 at 15:21 +0200, Anders Logg wrote:
>> > On Mon, May 08, 2006 at 03:08:55PM +0200, Garth N. Wells wrote:
>> > > Hi Anders,
>> > > 
>> > > We're using FFC here to compile some pretty complicated mixed 
nonlinear
>> > > problems, and the FFC compile time (around 1 hour) is becoming an 
issue.
>> > > Are there any areas in the code that are ripe for some optimisations
>> > > (data structures, etc) that we could help with?  
>> > > 
>> > > Garth
>> > 
>> > Yes.
>> > 
>> > Check with -d0 how long it takes to compute the reference
>> > tensor(s). There might be several in your case. This should be fairly
>> > quick (I hope) since this has been optimized.
>> > 
>> > I suspect the main issue may be the actual code generation going on in
>> > elementtensor.py, that is, generating the list of Declarations that
>> > are later written to file in dolfin.py. I haven't tried to optimize
>> > this part so it should be ripe for optimization.
>> > 
>> 
>> This looks like the bottleneck. The reference tensors were typically
>> computed in less than one second, with the longest being 12s. Below are
>> lines of output from the profiler for both without and with the "-f
>> blas" option.  More that 1/2 the run time is spent in
>> __compute_element_tensor_default and dolfin.py when not using blas.
>> 
>> Garth
>> 
>> **** Without -f blas
>> 
>>          47310360 function calls (47307129 primitive calls) in 463.410
>> CPU seconds
>> 
>>    Ordered by: internal time
>> 
>>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>         2  132.800   66.400  302.460  151.230                      
>> 
>> elementtensor.py:131(__compute_element_tensor_default)
>>   7131790  121.390    0.000  141.110    0.000 dolfin.py:35(<lambda>)
>>         8   40.440    5.055  136.120   17.015
>> elementtensor.py:175(__check_used)
>>   8083430   38.260    0.000   38.260    0.000 :0(join)
>>   7114400   30.700    0.000   30.700    0.000
>> referencetensor.py:154(__call__)
>>  10675073   28.710    0.000   28.710    0.000 debug.py:14(debug)
>>   7114406   25.290    0.000   25.290    0.000 :0(abs)>   3557202   15.440    
0.000   15.440    0.000 dolfin.py:47(<lambda>)
>>       857    5.420    0.006    5.420    0.006 :0(outer)
>> 
>> **** With -f blas
>> 
>> Generating XML output
>> Output written to CahnHilliard2D-0.xml
>> Output written to CahnHilliard2D-1.xml
>>          24012820 function calls (24009589 primitive calls) in 282.910
>> CPU seconds
>> 
>>    Ordered by: internal time
>> 
>>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>>         2   85.140   42.570  119.110   59.555 xml.py:70(__form)
>>   3574590   60.310    0.000   70.320    0.000 dolfin.py:35(<lambda>)
>>         8   40.760    5.095  138.210   17.276
>> elementtensor.py:175(__check_used)
>>   7303043   27.320    0.000   27.320    0.000 :0(join)
>>   3557239   17.140    0.000   17.140    0.000 :0(write)
>>   3557200   14.920    0.000   14.920    0.000
>> referencetensor.py:154(__call__)
>>   3557206   11.390    0.000   11.390    0.000 :0(abs)
>>       857    5.330    0.006    5.330    0.006 :0(outer)
>>     50720    4.170    0.000    8.440    0.000
>> geometrytensor.py:78(__call__)
>>        13    3.530    0.272    8.940    0.688
>> monomialintegration.py:208(__compute_product)
>>       133    2.950    0.022    5.130    0.039
>> multiindex.py:12(build_indices)
>> 763343/762858    1.990    0.000    2.000    0.000 :0(len)
>
>Yes, it does seem that the code in question is the actual code
>generation and not the computation. This could "easily" be optimized
>by anyone with some Python experience but not necessarily knowing how
>FFC works. It's just a matter of generating a long list of strings
>from an existing data structure. It would be great if someone from
>your group would be willing to look at it. I have done absolutely no
>optimization of this part of the code so there should be some
>opportunity for some big speedups.
>
>Here's a simple sketch of the most time-consuming piece of code, that
>is, __compute_element_tensor_default in elementtensor,py:
>
>  for i in iindices:
>       for a in aindices:
>           a0 = A0(i, a)
>           gk = format.format["geometry tensor"](j, a)
>           name = format.format["element tensor"](i, k)
>           value = format.format["sum"]([value, \
>                   format.format["multiplication"]([ \
>                   format.format["floating point"](a0), gk])])
>           declarations += [Declaration(name, value)]
>
>This iterates over a list of indices (iindices) and for each i
>generates a line of code that is a sum of products. The coefficients
>for the products are in the reference tensor A0 and the names are
>generated by looking in the format dictionary which is stored in
>dolfin.py. This is why dolfin.py shows up in the profiling.
>
>One obvious optimization is to move the generation of the variable
>"name" outside of the loop over a. Another would be to fetch the
>formats
>from the dictionary before the loop, so do something like
>
>    format_sum = format.format["sum"]
>
>before the loop and then use format_sum etc.
>
>If someone wants to take a look and needs more directions or
>clarifications, just ask.
>
>/Anders
>
>_______________________________________________
>FFC-dev mailing list
>FFC-dev@xxxxxxxxxx
>http://www.fenics.org/cgi-bin/mailman/listinfo/ffc-dev