← Back to team overview

ffc team mailing list archive

Re: Benchmark results for new BLAS mode

 


All I'm saying is that the theoretical top speed for the generated
code mode is faster than for the BLAS mode. Do you disagree with this
statement?

The reason this is so is because the generated code mode can
potentially use all the optimizations that BLAS does (blocking or
clever cache usage, Fortran) but BLAS cannot use all the optimizations
that the generated code mode uses (Ferari-style optimizations such as
skipping known zeros).

I see you what you mean, and agree somewhat, but I am skeptical. The irregularity introduced by FErari optimizations destroys spatial/temporal locality. The reduced flop count comes at a price that may not be able to use the machine as efficiently as the very regular computation patterns in BLAS -- you can do fewer flops, but they are slower. Second, as you indicate below, nobody wants to recreate something like ATLAS or FLAME. Besides, these can take a very long time just to generate a single matrix-matrix multiply with no particular structure (so if you tile one part you know how to tile all of it).


In the short term though, I don't think anybody is going to want to
spend time on these types of optimizations for the generated code mode
(premature optimization is the root of all evil etc.). But in the long
term someone might consider it worthwhile, especially if this type of
assembly becomes an established tool.

So here's my suggested plan: let's not lock ourselves into BLAS, keep
both methods for the foreseeable future. BLAS is a great tool, and may
be faster right now in runtime for certain forms and elements, but the
generated code mode will be faster than BLAS for all forms and all
elements if the BLAS-style optimizations are included (which can
certainly be done).


I completely agree with the following:
i.) There exist forms for which fine-grained FErari optimizations are the way to go. These are ones with "short G" -- the geometric tensor doesn't have much data in it. ii.) There exist forms for which level 3 BLAS are the way to go. They are the other end of the spectrum. iii.) For vector-valued Lagrange elements, it will (almost) always pay off to utilize block structure; even if there are no zero blocks, the nonzero blocks could have lots of structure and repetition that we should exploit.


On a side note, there can be a significant difference in top speed
performance between code produced by Fortran and C compilers. We had a
quite trivial example in a course to demonstrate why Fortran was so
popular for computing, and Fortran turned out to be more than 4 times
faster than C, given full optimization for both the Fortran and C
compilers (Sun compilers on an UltraSPARC). This was due to aliasing,
a Fortran compiler can assume that all variables are distinct in
memory, while a C compiler cannot.


I'm not sure how much this propogates to other compilers, but aliasing can be a problem. On the other hand, since we're generating code and know what n is up front, we could generate a very specific struct with n fields rather than use an array. This may explain Anders' observation that forming G slowed down when he started putting it in an array. Generating structs is not incompatible with BLAS -- you pass in a pointer to the head of the struct, cast to be a double.

Also, most FORTRAN compilers don't prevent you from doing aliasing -- you can pass in the same pointer to two arguments, and it won't complain at you. It optimizes in an unsafe way, so we have to be careful in the code generated (not that you don't in C).

Here's the example in C:

void horner(double* px, double* x, double* coeff, int n)
{
  int           j;
  double        xj;

  for(j = 0; j < n; j++)
    {
      xj = x[j];
px[j] = coeff[0] + xj * (coeff[1] + xj * (coeff[2] + xj * (coeff[3] + xj * coeff[4])));
    }
}

and Fortran 90:

subroutine horner( px, x, coeff, n )
  implicit none
  integer           n, j
  double precision  px(n), x(n), coeff(5), xj;

  do j = 1, n
     xj = x(j)
px(j) = coeff(1) + xj * (coeff(2) + xj * (coeff(3) + xj * (coeff(4) + xj * coeff(5))))
  end do

end subroutine horner

I don't think Fenics should focus on these types of issues though, it
should be enough to compare abstractions of methods, we shouldn't all
have to become Fortran or assembler hackers to prove our points (I
certainly am not, and have no ambition of becoming one).

  Johan




References